Cars Analysis

Carolina López De La Madriz

UC3M 2022

Introduction

This project consists on the analysis of a data base related with used cars and their corresponding prices, among with other features of cars. More precisely, we will be classifying cars with regard of the amount of pollution they generate. Also, the prices of these used cars will be predicted by building various Machine Learning models.

With this analysis, we have the opportunity to see how cars are related with pollution, depending on their fuel index, fuel type…In addition, we would have an idea of how much a used car costs depending on their model, category, manufacturer, mileage, engine volume….

This study is based on the data from a hiring competition.

Data Pre-processing

rm(list=ls())
setwd("~/Desktop/Statistical Learning/HW 2")

# LIBRARIES
library(ggplot2)
library(plotly)
library(dplyr)
library(tidyverse)
library(MASS)
library(caret)
library(e1071) 
library(GGally)
library(rpart)
library(rpart.plot)
library(randomForest)
library(leaps)
library(olsrr)
library(caret)
library(glmnet)
library(pROC)

In this section, we will first explore the data. This data contains 19237 observations and 18 variables. Each row corresponds to a car that has been used before, although we will see that some of them may not been used before (as they have 0 mileage).

The columns of the data correspond to the following:

  1. ID: identification number for each car.

  2. Price: price of the car.

  3. Levy: the motor vehicle levy helps cover the cost of accidents on public roads involving moving vehicles. This variable indicates the amount of levy paid for each car.

  4. Manufacturer: business entity that has produced the automobile.

  5. Model: car design of the manufacturer.

  6. Prod.year: year of production.

  7. Category: type of car.

  8. Leather.interior: whether a car has leather in the interior.

  9. Fuel.type: Car fuel type (i.e gas or diesel)

  10. Engine.volume: total volume of the cylinders in the engine.

  11. Mileage: km that the car has traveled

  12. Cylinders: number of cylinders of the car (vital part of the engine where fuel is combusted and power is generated)

  13. Gear.box.type: type of transmission (i.e automatic or manual)

  14. Drive.wheels: type of drive of the wheels

  15. Doors: number of doors

  16. Wheel: wheel base of a car

  17. Color: color of the car

  18. Airbags: number of airbags available in the car

Data pre-processing taking into account the main goal of the project: car price prediction and classification of a new variable that will be created related to the amount of pollution that a car generates. First let us take an insight of the data structure and its composition.

# read the csv file
data = read.csv("Cars.csv")

head(data)
##         ID Price Levy Manufacturer    Model Prod..year  Category
## 1 45654403 13328 1399        LEXUS   RX 450         NA      Jeep
## 2 44731507 16621 1018    CHEVROLET  Equinox       2011      Jeep
## 3 45774419  8467    -        HONDA      FIT       2006 Hatchback
## 4 45769185  3607  862         FORD   Escape       2011      Jeep
## 5 45809263 11726  446        HONDA      FIT       2014 Hatchback
## 6 45802912 39493  891      HYUNDAI Santa FE       2016      Jeep
##   Leather.interior Fuel.type Engine.volume   Mileage Cylinders Gear.box.type
## 1              Yes    Hybrid           3.5 186005 km         6     Automatic
## 2               No    Petrol             3 192000 km         6     Tiptronic
## 3               No    Petrol           1.3 200000 km         4      Variator
## 4              Yes    Hybrid           2.5 168966 km         4     Automatic
## 5              Yes    Petrol           1.3  91901 km         4     Automatic
## 6              Yes    Diesel             2 160931 km         4     Automatic
##   Drive.wheels  Doors            Wheel  Color Airbags
## 1          4x4 04-May       Left wheel Silver      12
## 2          4x4 04-May       Left wheel  Black       8
## 3        Front 04-May Right-hand drive  Black       2
## 4          4x4 04-May       Left wheel  White       0
## 5        Front 04-May       Left wheel Silver       4
## 6        Front 04-May       Left wheel  White       4
# structure
str(data)
## 'data.frame':    19237 obs. of  18 variables:
##  $ ID              : int  45654403 44731507 45774419 45769185 45809263 45802912 45656768 45816158 45641395 45756839 ...
##  $ Price           : int  13328 16621 8467 3607 11726 39493 1803 549 1098 26657 ...
##  $ Levy            : chr  "1399" "1018" "-" "862" ...
##  $ Manufacturer    : chr  "LEXUS" "CHEVROLET" "HONDA" "FORD" ...
##  $ Model           : chr  "RX 450" "Equinox" "FIT" "Escape" ...
##  $ Prod..year      : int  NA 2011 2006 2011 2014 2016 2010 2013 2014 2007 ...
##  $ Category        : chr  "Jeep" "Jeep" "Hatchback" "Jeep" ...
##  $ Leather.interior: chr  "Yes" "No" "No" "Yes" ...
##  $ Fuel.type       : chr  "Hybrid" "Petrol" "Petrol" "Hybrid" ...
##  $ Engine.volume   : chr  "3.5" "3" "1.3" "2.5" ...
##  $ Mileage         : chr  "186005 km" "192000 km" "200000 km" "168966 km" ...
##  $ Cylinders       : int  6 6 4 4 4 4 4 4 4 6 ...
##  $ Gear.box.type   : chr  "Automatic" "Tiptronic" "Variator" "Automatic" ...
##  $ Drive.wheels    : chr  "4x4" "4x4" "Front" "4x4" ...
##  $ Doors           : chr  "04-May" "04-May" "04-May" "04-May" ...
##  $ Wheel           : chr  "Left wheel" "Left wheel" "Right-hand drive" "Left wheel" ...
##  $ Color           : chr  "Silver" "Black" "Black" "White" ...
##  $ Airbags         : int  12 8 2 0 4 4 12 12 12 12 ...
summary(data)
##        ID               Price              Levy           Manufacturer      
##  Min.   :20746880   Min.   :       1   Length:19237       Length:19237      
##  1st Qu.:45698355   1st Qu.:    5331   Class :character   Class :character  
##  Median :45772306   Median :   13172   Mode  :character   Mode  :character  
##  Mean   :45576555   Mean   :   18556                                        
##  3rd Qu.:45802035   3rd Qu.:   22075                                        
##  Max.   :45816654   Max.   :26307500                                        
##  NA's   :20                                                                 
##     Model             Prod..year     Category         Leather.interior  
##  Length:19237       Min.   :1939   Length:19237       Length:19237      
##  Class :character   1st Qu.:2009   Class :character   Class :character  
##  Mode  :character   Median :2012   Mode  :character   Mode  :character  
##                     Mean   :2011                                        
##                     3rd Qu.:2015                                        
##                     Max.   :2020                                        
##                     NA's   :10                                          
##   Fuel.type         Engine.volume        Mileage            Cylinders     
##  Length:19237       Length:19237       Length:19237       Min.   : 1.000  
##  Class :character   Class :character   Class :character   1st Qu.: 4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 4.000  
##                                                           Mean   : 4.583  
##                                                           3rd Qu.: 4.000  
##                                                           Max.   :16.000  
##                                                           NA's   :20      
##  Gear.box.type      Drive.wheels          Doors              Wheel          
##  Length:19237       Length:19237       Length:19237       Length:19237      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     Color              Airbags      
##  Length:19237       Min.   : 0.000  
##  Class :character   1st Qu.: 4.000  
##  Mode  :character   Median : 6.000  
##                     Mean   : 6.581  
##                     3rd Qu.:12.000  
##                     Max.   :16.000  
##                     NA's   :24
# 19237 observations and 18 variables 
# 5 numeric columns (integer) and 13 categorical columns (characters)

As it has been stated, there are 19237 observations and 18 variables, where we can find 5 numeric columns and 13 categorical columns.

Before the analysis, we must prepare the input.

Adjustment and creation of variables

I don’t like how the production year is written so I will rewrite it to make it look uniform with the names of the other columns of the data set.

colnames(data)[6] = "Prod.year"

Mileage column gives us how many km the car has driven. “km” is written in the column after each reading. However I will remove it, and convert to a numeric variable for a better use.

data$Mileage = gsub('km', '', data$Mileage)
data$Mileage = as.numeric(data$Mileage)

A car that has 0 km traveled is the same as a brand new car, as it hasn’t been used before. Therefore, we will create a new variable (New) to know whether a car is new or not.

# if a car has 0 km will mean that is new 
# let us create a variable for this kind of cars
data$New = (data$Mileage == 0)
# TRUE = Yes, new  
# FALSE = No, not new
data$New = ifelse(data$New == TRUE, "Yes", "No")

# nas doesn't mean 0 km so, we will just set the nas of mileage to no brand new in this variable
for (i in 1:nrow(data)){
  if(is.na(data[i,19])){
    data[i,19] = 0
  }else if(data[i,19] == "No"){
    data[i,19] = 0
  }else{
    data[i,19] = 1
  }
}

# 1 -> brand new
# 0 -> used

data$New = as.numeric(data$New)
sum(data$New == 1) / 19237 # number of brand new cars
## [1] 0.03737589

We can see that there are not many brand new cars.

In the Engine.volume column, we see that some cars have also written their type of engine; that is turbo, or not turbo. We will just create a new column that shows the type of engine and then remove the turbo part of this variable

# grepl function to test whether a character is in a string
data$Turbo = grepl("Turbo", data$Engine.volume)
data$Turbo = ifelse(data$Turbo == TRUE, "Yes", "No")

for(i in 1:nrow(data)){
  if(data[i, 20] == "Yes"){
    data[i, 20] = 1
  }else{
    data[i, 20] = 0
  }
}

# 0 -> no turbo 
# 1 -> turbo
data$Turbo = as.numeric(data$Turbo)

# if TRUE, the car is Turbo Engine
sum(data$Turbo == 1) / 19237 # 0.1% of the cars are turbo 
## [1] 0.1002235
# now we can remove the Turbo part of Engine.volume and convert the variable to numeric
data$Engine.volume = gsub('Turbo', '', data$Engine.volume)
data$Engine.volume = as.numeric(data$Engine.volume)

Regarding the Doors column, we will make some changes to their information. “04-May” will just be simplified to 4, as it just means that the car has 4 doors. Same thing will be done for “02-Mar”, corresponding to 2 doors. The rest of cars will have more than 5 doors (for instance a mini bus).

# Doors column
unique(data$Doors)
## [1] "04-May" "02-Mar" ">5"
sum(is.na(data$Doors)) #check just in case
## [1] 0
# 04-may just means 4 
# 02-mar just means 2
for (i in 1:nrow(data)){
  if(data[i,15] == "04-May"){
    data[i, 15] = 4
  }else if(data[i,15] == "02-Mar"){
    data[i,15] = 2
  }
}

# it cannot be converted to a numeric variable because >5 is undefined

Handling missing values

# checking for missing data
sum(is.na(data))
## [1] 102
colSums(is.na(data))
##               ID            Price             Levy     Manufacturer 
##               20                0                0                0 
##            Model        Prod.year         Category Leather.interior 
##                0               10                0                0 
##        Fuel.type    Engine.volume          Mileage        Cylinders 
##                0               12               16               20 
##    Gear.box.type     Drive.wheels            Doors            Wheel 
##                0                0                0                0 
##            Color          Airbags              New            Turbo 
##                0               24                0                0

We can see that there are NAs in some numeric columns: ID, Prod..year, Engine Volume, Mileage, Cylinders and Airbags.

NAs in ID will not really be significant as ID is a column that doesn’t provide relevant information for the approach of this analysis. In fact, i’m just going to remove this column not because of containing NA’s but due to its lack of contribution to the goal of the project. From my point of view, it does not provide information about car prices nor car pollution.

data = data[, -1]

NAs in Production year

We just have 10 cars with no information related with their year of production. Thus, in this case we can analyze each of them and, considering the model of the car, assign the average year of production of those cars.

nas_year = which(is.na(data$Prod.year))
# returns the index of the row with the NA value in Prod.year

# I'm going to create a variable with the models of the cars with NA's in Prod.Year
# with this variable, i will then compute the avg Prod.Year of all the cars in the dataset
# with each of these models to later assign that value to the NA's that I have

for(car in nas_year){
  data$Prod.year[car] = round(mean(data$Prod.year[data$Model == data$Model[car]], na.rm = TRUE))
}

colSums(is.na(data))
##            Price             Levy     Manufacturer            Model 
##                0                0                0                0 
##        Prod.year         Category Leather.interior        Fuel.type 
##                0                0                0                0 
##    Engine.volume          Mileage        Cylinders    Gear.box.type 
##               12               16               20                0 
##     Drive.wheels            Doors            Wheel            Color 
##                0                0                0                0 
##          Airbags              New            Turbo 
##               24                0                0
# successfully handled

NAs in Engine Volume

We have 12 missing values (zeros). The engine volume of a car refers to the volume of fuel and air that can be pushed through a car’s cylinders and is measured in cubic centimetres (cc). Then we can predict the NA’s with the cylinders column.

nas_eng = which(is.na(data$Engine.volume))

# for each NA, i will assign 
for(car in nas_eng){
  data$Engine.volume[car] = round(mean(data$Engine.volume[(data$Cylinders == data$Cylinders[car]) & (data$Model == data$Model[car])], na.rm = TRUE))
}

colSums(is.na(data))
##            Price             Levy     Manufacturer            Model 
##                0                0                0                0 
##        Prod.year         Category Leather.interior        Fuel.type 
##                0                0                0                0 
##    Engine.volume          Mileage        Cylinders    Gear.box.type 
##                1               16               20                0 
##     Drive.wheels            Doors            Wheel            Color 
##                0                0                0                0 
##          Airbags              New            Turbo 
##               24                0                0
# all nas have been succesfully handled except for one

which(is.na(data$Engine.volume))
## [1] 9032
data$Cylinders[which(is.na(data$Engine.volume))]
## [1] NA
# apparently, this certain car has an NA in its cylinders, which is the reason why 
# it hasn't been able to be completely handled
# for this particular case, i will estimate the na just considering the model of the car
data$Engine.volume[which(is.na(data$Engine.volume))] = round(mean(data$Engine.volume[data$Model == data$Model[car]], na.rm = TRUE))

colSums(is.na(data))
##            Price             Levy     Manufacturer            Model 
##                0                0                0                0 
##        Prod.year         Category Leather.interior        Fuel.type 
##                0                0                0                0 
##    Engine.volume          Mileage        Cylinders    Gear.box.type 
##                0               16               20                0 
##     Drive.wheels            Doors            Wheel            Color 
##                0                0                0                0 
##          Airbags              New            Turbo 
##               24                0                0
# now its ok

NAs in Mileage

We have 16 missing values. It is difficult to estimate the amount of km of a car. However I will based the estimation on the average km of the production year of the car.

nas_km = which(is.na(data$Mileage)) 
# returns the index of the row with the NA value in Mileage

# for each NA, i will assign the avg number of km of the cars of the same production year and model as the NA
for(car in nas_km){
  data$Mileage[car] = round(mean(data$Mileage[data$Prod.year == data$Prod.year[car]], na.rm = TRUE))
}

colSums(is.na(data))
##            Price             Levy     Manufacturer            Model 
##                0                0                0                0 
##        Prod.year         Category Leather.interior        Fuel.type 
##                0                0                0                0 
##    Engine.volume          Mileage        Cylinders    Gear.box.type 
##                0                0               20                0 
##     Drive.wheels            Doors            Wheel            Color 
##                0                0                0                0 
##          Airbags              New            Turbo 
##               24                0                0
# successfully handled

NAs in Cylinders

Cylinders are the power unit of an engine. We have 20 missing values which can be handled by using the engine column.

nas_cyl = which(is.na(data$Cylinders))

for(car in nas_cyl){
  data$Cylinders[car] = round(mean(data$Cylinders[data$Engine.volume == data$Engine.volume[car]], na.rm = TRUE))
}

colSums(is.na(data))
##            Price             Levy     Manufacturer            Model 
##                0                0                0                0 
##        Prod.year         Category Leather.interior        Fuel.type 
##                0                0                0                0 
##    Engine.volume          Mileage        Cylinders    Gear.box.type 
##                0                0                0                0 
##     Drive.wheels            Doors            Wheel            Color 
##                0                0                0                0 
##          Airbags              New            Turbo 
##               24                0                0
# successfully handled

NAs in Airbags

We have 24 missing values. The airbag, which today is also something normal, was invented in 1971 by Mercedes-Benz, but it was not installed in a car until 1981, and it was not until the 90s that it began to become widespread in all kinds of makes and models. Since 2006 it is mandatory for all cars to have at least two airbags in the front part of the car.

We will take into account this regulation to handle the cars with NAs or 0 airbags as it makes no sense that some cars had airbags before they were even invented (it may be some kind of error).

  • For all cars with Prod.year < 1981, we will set the number of airbags to 0

  • For all cars with 1981 < Prod.year < 2006, if there is a NA it will just be set to 0

  • For all cars with Prod.year > 2006, if there is a NA it will be set to 2 (mandatory)

for (i in 1:nrow(data)){
  if(data[i,5] < 1981){ # Prod.year < 1981
    data[i,17] = 0 # set airbags to 0
  }else if((data[i, 5] < 2006) & is.na(data[i, 17])){ # Prod.year between 1981 and 2006 and NA in airbag
    data[i, 17] = 0
  }else if(is.na(data[i, 17])){ # Prod.year > 2006 and NA in airbag
    data[i, 17] = 2
  }
}

colSums(is.na(data))
##            Price             Levy     Manufacturer            Model 
##                0                0                0                0 
##        Prod.year         Category Leather.interior        Fuel.type 
##                0                0                0                0 
##    Engine.volume          Mileage        Cylinders    Gear.box.type 
##                0                0                0                0 
##     Drive.wheels            Doors            Wheel            Color 
##                0                0                0                0 
##          Airbags              New            Turbo 
##                0                0                0
# successfully handled

Thus, for all cars with 0 airbags that have been produced in 2006 or later on, we will just set their number of airbags to 2 (as it is the least amount that they must have for sure).

NAs in Levy

After analyzing the Levy column, I found out that it does contain missing values but they were given as ‘-’ in the data and that’s why we were not able to capture the missing values earlier. Here we could impute ‘-’ in the ‘Levy’ column with ‘0’ assuming there was no ‘Levy’. Or we could impute it with ‘mean’ or ‘median. However, we cannot set these values to 0 as there is a a mandatory insurance for a car (“Seguro obligatorio a terceros”) which is equal to 87 (we can see that it is the minimum levy for the rest of the cars). Therefore, for every’-’ value, we will just set it to 87.

for (i in 1:nrow(data)){
  if(data[i, 2] == "-"){
    data[i, 2] = "87"
  }
}

data$Levy = as.numeric(data$Levy)

Creating more variables

Now we will create the Pollution variable that will be later used for the classification part. This variable will be categorical, with values “HIGH”, “MEDIUM” or “LOW” depending on the pollution generated by the car.

In order to determine in which rank a car is, we will use a formula (for generating a number = index to later assign each group to a certain interval depending on this index). For it, we will also create a variable that will contain the fuel index of a car, depending on the CO2 emissions.

After making a research on automobile sites, I have come up with this formula:

Pollution index = Fuel index * ( Engine volume + Cylinders )

unique(data$Fuel.type)
## [1] "Hybrid"         "Petrol"         "Diesel"         "CNG"           
## [5] "Plug-in Hybrid" "LPG"            "Hydrogen"

The Coefficient of Fuel is based on the calculation of co2 emissions of each type of Fuel.

CO2 EMISSIONS
Type of Fuel CO2 emissions (in kg per liter)
Diesel 2.4
Petrol 2.3
LPG (Licuated Petrol Gas) 1.51
CNG (Cars with Natural Gas) 1.39
Hybrid 0.7
Plug-in-hybrid 0.5
Hydrogen 0.1

Considering the most pollutant fuel as diesel, we can compute the indexes for the rest of the fuels and create a new column, containing them depending on the car.

# establishing the CO2 emissions
co2_emission_diesel = 2.4
co2_emission_petrol = 2.3
co2_emission_lpg = 1.51
co2_emission_cng = 1.39
co2_emission_hybrid = 0.7
co2_emission_plug_in = 0.5
co2_emission_hydrogen = 0.1

# computing the coefficients
diesel = round(co2_emission_diesel / co2_emission_diesel, 2)
petrol = round(co2_emission_petrol / co2_emission_diesel, 2)
lpg = round(co2_emission_lpg / co2_emission_diesel, 2)
cng = round(co2_emission_cng / co2_emission_diesel, 2)
hybrid = round(co2_emission_hybrid / co2_emission_diesel, 2)
plug_in = round(co2_emission_plug_in / co2_emission_diesel, 2)
hydrogen = round(co2_emission_hydrogen / co2_emission_diesel, 2)

data$Fuel.idx = 0
# column for determining the index of the fuel type by means of pollution

for(i in 1:nrow(data)){
  if(data[i, 8] == "Diesel"){
    data[i, 20] = diesel
  }else if(data[i, 8] == "Petrol"){
    data[i, 20] = petrol
  }else if(data[i, 8] == "LPG"){
    data[i, 20] = lpg
  }else if(data[i, 8] == "CNG"){
    data[i, 20] = cng
  }else if(data[i, 8] == "Hybrid"){
    data[i, 20] = hybrid
  }else if(data[i, 8] == "Plug-in Hybrid"){
    data[i, 20] = plug_in
  }else{ # hydrogen
    data[i, 20] = hydrogen
  }
}

We have to create the Pollution variable. This column will have the correspondent result to the previous formula applied on each car to classify the vehicles depending on how pollutant they are.

data$Pollution = 0 # creating the column and initializing it to 0

# index of fuel * (Engine.volume + Cylinders)
for(i in 1:nrow(data)){
  data[i, 21] = data[i, 20] * (data[i, 9] + data[i, 11])
}

summary(data$Pollution)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.336   4.032   5.760   5.689   6.300  18.720

Right now pollution contains the result of the formula. Now cars will be classified as:

  • HIGH: pollution > 6.5

  • MEDIUM: pollution < 4

  • LOW: pollution <= 4

for(i in 1:nrow(data)){
  p = data[i, 20] * (data[i, 9] + data[i, 11])
  if(p > 6.5){ # HIGH
    data[i, 21] = "HIGH"
  }else if(p < 4){ # LOW
    data[i, 21] = "LOW"
  }else{ # MEDIUM
    data[i, 21] = "MEDIUM"
  }
}

Price is going to be the target column/dependent feature for the regression part of the project.

Pollution will be the target column for the classification part.

Outliers

Studying the outliers of the numeric variables by IQR:

Outliers in Price

QI = quantile(data$Price, 0.25)
QS = quantile(data$Price, 0.75)
IQR = QS-QI

names(QI) = NULL
names(QS) = NULL

sum(data$Price < QI - 1.5*IQR | data$Price > QS + 1.5*IQR)
## [1] 1073
ggplot(data = data, mapping = aes(y = Price)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 4) +
  ggtitle(expression(atop("Boxplot of Price", atop(italic("In Red Outliers"))))) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.subtitle = element_text(hjust = 0.5))

outliers = c()
data$Price = as.numeric(data$Price)
for(i in 1:nrow(data)){
  if(data[i, 1] < (QI - 1.5*IQR) | data[i, 1] > (QS + 1.5*IQR)){
    outliers = c(outliers, i)
  }
  
}

a = head(which(data$Price < (QI - 1.5*IQR) | data$Price > (QS + 1.5*IQR))) 
prices = c()
for (i in a){
  prices = c(prices, data[i, 1])
}

prices
## [1] 59464 51746 55390 87112 53154 77775

As we can see the outliers of the price variables are very large numbers. If we take a look at the data, we have very small values for prices of cars, which someone may think that it does not make much sense. However, this can be possible as in some cases it is cheaper to sell a car (usually old cars) at a very low price than to take it to the car scrapping. Indeed, they may be cars that do not work well and need to be fixed. We should also take into account that this data set is mainly of used cars. For these reasons, cars with low prices make sense and we can see in the data set that these kind of cars have a very large number of mileage which, again, is in favor of this suggested idea.

However, in order to make the comparison and analysis of prices reasonable, I will remove all cars which are not from the 21st century. From my point of view, it doesn’t make sense to be comparing prices within an interval of time that big (1939-2020) because of factors such as the inflation and the continuous change of currency.

not21 = c() # vector which will contain the position of all the cars which are not form the 21st century
for(i in 1:nrow(data)){
  if(data[i, 5] < 2000){
    not21 = c(not21, i)
  }
  
}

length(not21) # 986 is not even 0.1% of the data
## [1] 982
data = data[-not21, ]

This change doesn’t really affect our data, as cars from those years were not even 0.1% of the data. Now that we are just considering cars of the 21st century, we can safely compute the outliers in Price.

QI = quantile(data$Price, 0.25)
QS = quantile(data$Price, 0.75)
IQR = QS-QI

names(QI) = NULL
names(QS) = NULL

sum(data$Price < QI - 1.5*IQR | data$Price > QS + 1.5*IQR)
## [1] 1014
ggplot(data = data, mapping = aes(y = Price)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 4) +
  ggtitle(expression(atop("Boxplot of Price", atop(italic("In Red Outliers"))))) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.subtitle = element_text(hjust = 0.5))

From the boxplot, we see that many cars have a cheap price, but this can certainly be possible as our data set is not just of new cars, we have a very wide number of car categories.

Let us remove the outliers.

outliers = c()
for(i in 1:nrow(data)){
  if(data[i, 1] < (QI - 1.5*IQR) | data[i, 1] > (QS + 1.5*IQR)){
    outliers = c(outliers, i)
  }
  
}

data = data[-outliers, ]

Outliers in Levy

QI = quantile(data$Levy, 0.25)
QS = quantile(data$Levy, 0.75)
IQR = QS-QI

sum(data$Levy < QI - 1.5*IQR | data$Levy > QS + 1.5*IQR)
## [1] 189
ggplot(data = data, mapping = aes(y = Levy)) +
  geom_boxplot(fill = "slategray1", outlier.color = "red", outlier.shape = 4) +
  ggtitle(expression(atop("Boxplot of Levy", atop(italic("In Red Outliers"))))) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.subtitle = element_text(hjust = 0.5))

We can see that there are several outliers that we can safely remove.

outliers = c()
for(i in 1:nrow(data)){
  if(data[i, 2] < (QI - 1.5*IQR) | data[i, 2] > (QS + 1.5*IQR)){
    outliers = c(outliers, i)
  }
  
}

data = data[-outliers, ]

Outliers in Year of Production

ggplot(data = data, mapping = aes(y = Prod.year)) +
  geom_boxplot(fill = "darkolivegreen1", outlier.color = "red", outlier.shape = 4) +
  ggtitle(expression(atop("Boxplot of Production year", atop(italic("In Red Outliers"))))) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.subtitle = element_text(hjust = 0.5))

For this variable, we see there are several outliers but it is not necessary to remove them, because we want to consider all the cars from the 21st century. We may have more cars from 2010 until now, but that doesn’t mean that we don’t want to consider the ones from years such as 2000, beacuse they are still part of our analysis. Therefore, it not necessary to remove the outliers of this variable.

Outliers in Engine Volume

QI = quantile(data$Engine.volume, 0.25)
QS = quantile(data$Engine.volume, 0.75)
IQR = QS-QI

sum(data$Engine.volume < QI - 1.5*IQR | data$Engine.volume > QS + 1.5*IQR)
## [1] 987
ggplot(data = data, mapping = aes(y = Engine.volume)) +
  geom_boxplot(fill = "thistle1", outlier.color = "red", outlier.shape = 4) +
  ggtitle(expression(atop("Boxplot of Engine Volume", atop(italic("In Red Outliers"))))) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.subtitle = element_text(hjust = 0.5))

From the boxplot we see that we have several outliers, which may influence our analysis on the classification or prediction part as this variable seems to be relevant for a car (from its definition, we will see later if this is true or not though)

outliers = c()

for(i in 1:nrow(data)){
  if(data[i, 9] < (QI - 1.5*IQR) | data[i, 9] > (QS + 1.5*IQR)){
    outliers = c(outliers, i)
  }
  
}
data = data[-outliers, ]

Outliers in Mileage

QI = quantile(data$Mileage, 0.25)
QS = quantile(data$Mileage, 0.75)
IQR = QS-QI

sum(data$Mileage < QI - 1.5*IQR | data$Mileage > QS + 1.5*IQR)
## [1] 623
ggplot(data = data, mapping = aes(y = Mileage)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 4) +
  ggtitle(expression(atop("Boxplot of Mileage", atop(italic("In Red Outliers"))))) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.subtitle = element_text(hjust = 0.5))

From the boxplot, we can observe there are some outliers. We should take into account that this variable is very relative as it depends on the owner of the car, its preferences (it may be a person that loves to travel a lot and so she/he uses the car frequently)… Therefore, I feel like we should keep the outliers as it could help with the analysis by the characteristics that these kind of cars (the ones that have consumed a very large distance) have and how they affect the price variable.

Outliers in Cylinders

Let us compute them by the 3-sigma rule.

mu = mean(data$Cylinders)
sigma = sd(data$Cylinders)

sum(data$Cylinders < mu - 3*sigma | data$Cylinders > mu + 3*sigma)
## [1] 102

Cylinders’ outliers will not be removed as we have already removed Engine Volume outliers which is a variable that is related with cylinders (the engine volume of a car depends on the cylinders of it). At the same time, it may be interesting for the analysis to see how the price and the pollution of a car is affected by these outliers. Indeed they may be contribute for a better analysis.

Outliers in Airbags

QI = quantile(data$Airbags, 0.25)
QS = quantile(data$Airbags, 0.75)
IQR = QS-QI

sum(data$Airbags < QI - 1.5*IQR | data$Airbags > QS + 1.5*IQR)
## [1] 0
ggplot(data = data, mapping = aes(y = Airbags)) +
  geom_boxplot(fill = "mediumslateblue", outlier.color = "thistle", outlier.shape = 4) +
  ggtitle(expression(atop("Boxplot of Airbags", atop(italic("In Red Outliers"))))) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.subtitle = element_text(hjust = 0.5))

From the plot, we see that there are not outliers for the Airbags variable.

New and Turbo are also numeric variables, but they were taken from Mileage and Engine Volume and actually represent logic values.

Outliers in Fuel index

QI = quantile(data$Fuel.idx, 0.25)
QS = quantile(data$Fuel.idx, 0.75)
IQR = QS-QI

sum(data$Fuel.idx < QI - 1.5*IQR | data$Fuel.idx > QS + 1.5*IQR)
## [1] 1
ggplot(data = data, mapping = aes(y = Fuel.idx)) +
  geom_boxplot(fill = "plum2", outlier.color = "red", outlier.shape = 4) + 
  ggtitle(expression(atop("Boxplot of Fuel idx", atop(italic("In Red Outliers"))))) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.subtitle = element_text(hjust = 0.5))

We have 1 outlier that can be safely removed from the data.

outlier = which(data$Fuel.idx < QI - 1.5*IQR | data$Fuel.idx > QS + 1.5*IQR)

data = data[-outlier, ]

Visualization

In this part we consider general plots to get an insight of the data set. We also try to start looking at relations with our target variables.

plot_ly(data, labels = ~Color, type = 'pie') %>%
  layout(title = 'Color of Cars',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

This plot shows the different proportions of the colors of cars that are being sold. We see that black, white and silver cars are the 3 top colors of cars on sale. I found interesting that 6.93% of the cars are purple, which I usually consider it to be as an unusual color for a car.

ggplot(data = data) +
  geom_bar(mapping = aes(x = Category, fill = Fuel.type))

In this graph we can appreciate the number of cars that are on sale depending on the category that they belong to. For each category, the bar is divided in regard to the proportion of fuel type that is mainly used by that category. Furthermore, we can see that the most common category is Sedan. That makes sense, as Sedan makes reference to the normal cars in general ( defined as a 4-door passenger car with a trunk that is separate from the passengers). Among all cars, Petrol seems to be the most common fuel type, follower by Diesel and Hybrid.

ggplot(data = data, aes(x = reorder(Manufacturer, Price),y = Price)) + 
  geom_bar(stat ='identity',aes(fill = Price))+
  coord_flip() + 
  theme_grey() + 
  scale_fill_gradient(name="Price Level")+
  labs(title = 'Ranking of Manufacturers by Price',
       y='Manufacturer', x='Price')+ 
  geom_hline(yintercept = mean(data$Price),size = 1, color = 'blue')

Not many cars have a very high price, but from the plot we see how “Hyundai” and “Toyota” are the two companies with higher prices for their cars in this data set. There are many others which have common cheap prices.

ggplot(data, aes(Engine.volume)) + geom_density(aes(fill=factor(Cylinders)), alpha=0.8) + 
  labs(title="Density plot", 
       subtitle="Engine volume grouped by number of cylinders",
       x="Engine Volume",
       fill="Cylinders")

At first, someone may think that the engine volume should depend on the number of cylinders of a car. Cars with higher number of cylinders, that are the most powerful cars, always try to optimize the volume of the engine so that there are more cylinders in a reduced space which also means a reduction in the weight (but that means an increase in the price, usually)

We will later see if this idea is true or not. But the cars with more cylinders in a very low engine (low engine volume), will cost (hypothetically) more than the others, being sportive cars for instance.

ggplot(data, aes(Gear.box.type, Prod.year)) + geom_boxplot(aes(fill=factor(Pollution))) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title = "Box plot", 
       subtitle = "Production year grouped by Type of transmission",
       x = "Type of transmission",
       y=  "Production Year", 
       fill = "Pollution")

From this plot we see that the automatic cars are the most recent type of cars and the manual cars the most antique.

Also, the cars with higher pollution were produced approximately between 2000 and 2010, and tiptronic cars are the ones that contribute most to pollution. We can see how for each transmission type, the cars that are more pollutant were produced earlier than the ones that are less pollutant, which makes sense as during the years, governments have been adopting measurements to stop pollution

ggplot(data, aes(x = Airbags)) +
  geom_histogram(fill="darkseagreen2", col="darkslateblue", binwidth = 2)

The histogram above shows the amount of cars that have different number of airbags. As it can be observed, most of the vehicles have around 4 airbags. There are also many vehicles with around 12 airbags.

ggplot(data, aes(x = Engine.volume, y = Levy, col = Cylinders)) + 
  geom_point(size = 2, shape = 1, alpha = 0.4) + 
  labs(x = "Engine Volume",y = "Levy", color = "Cylinders")

When comparing the influence of the engine volume in the Levy, we can notice that there is a kind of positive tendency as the engine volume increases related with the levy paid for a car (insurance). This is probably because as the size of the engine increases, the car is better (also more expensive) and so the insurance companies usually require a greater amount of money to be paid

ggplot(data) + geom_point(aes(x = Price, y = Prod.year, color = Gear.box.type), 
                          size=1)

Indeed, most of the cars are automatic, mainly from 2005 to 2020. In the earliest 20’s there were more manual cars and the price was smaller. As the production year increases, the price increases as well, although there are still many different prices as not all the cars are new nor have the best conditions or characteristics.

ggplot(data, aes(x=Category, y= Gear.box.type, color = Fuel.type)) + 
  geom_jitter(size = 2, width = 0.1, alpha = 0.4) + 
  labs(x = "Category",y = "Type of transmission", color = "Type of Fuel")

Regarding the different transmission types that we consider in this data set, we can see how each category of cars use a different type of fuel from the plot. The three colors that appear most are the ones corresponding to petrol, hybrid and diesel. Diesel is mainly in the manual cars, and appears most in “Goods wagon” and “Microbus”. Most of the cars use petrol or are hybrids though petrol cars are mainly noticed in manual and tiptronic cars, where “Sedan” and “Jeep” have the greatest number of petrol cars. Hybrid cars are mainly of automatic or variator transmission, and are mainly found in “Hatchback” or “Sedan”.

Let us make some plots taking into account the important variables that we are considering for the project (Pollution and Price)

ggplot(data = data) +
  geom_smooth(mapping = aes(x = Prod.year, y = Price), color = 'red')

In this plot we can see that, as the production year increases, the price of a car increases as well. Probably because of the change of currency that has taken place among the years. However, from 2017 approximately the price starts decreasing. I feel like this may be due to, for instance, the new renting options that have appeared, generating a decrease on the number of sold cars, and making the people who is trying to sell the cars to decrease the prices in order to actually make profit.

ggplot(data, aes(x=as.factor(Drive.wheels), y=Price)) + geom_boxplot(fill="cornflowerblue") +
  ggtitle("Total Price vs Drive wheels")

With this plot we observe that cars being sold are usually 4x4, and that the mean price of the three types is slightly similar. We can also notice that the amount of cars of “Front” wheel type is smaller than the amount of the other two (of course, being the 4x4 type the one with more cars). But in general, this distribution is pretty similar for the three types. The most noticeable difference is the amount of cars of each type.

Classification

Let’s start with the classification part, where we will try to predict a target class: Pollution. The algorithm needs to identify which class does a data object belong to.

Preparing data

data2 = data # data where I will make changes on the type of the variables
# first character to factor
# then factor to numeric

data2$Manufacturer = as.factor(data2$Manufacturer)
data2$Manufacturer = as.numeric(data2$Manufacturer)

data2$Model = as.factor(data2$Model)
data2$Model = as.numeric(data2$Model)

data2$Category = as.factor(data2$Category)
data2$Category = as.numeric(data2$Category)

data2$Leather.interior = as.factor(data2$Leather.interior)
data2$Leather.interior = as.numeric(data2$Leather.interior)

data2$Fuel.type = as.factor(data2$Fuel.type)
data2$Fuel.type = as.numeric(data2$Fuel.type)

data2$Gear.box.type = as.factor(data2$Gear.box.type)
data2$Gear.box.type = as.numeric(data2$Gear.box.type)

data2$Drive.wheels = as.factor(data2$Drive.wheels)
data2$Drive.wheels = as.numeric(data2$Drive.wheels)

data2$Doors = as.factor(data2$Doors)
data2$Doors = as.numeric(data2$Doors)

data2$Wheel = as.factor(data2$Wheel)
data2$Wheel = as.numeric(data2$Wheel)

data2$Color = as.factor(data2$Color)
data2$Color = as.numeric(data2$Color)

data2$Pollution= as.factor(data2$Pollution)
data2$Pollution = as.numeric(data2$Pollution)

We start by converting data to factors. Specifically, the Pollution variable. As it was mentioned before, it will be representing how pollutant a car is (“HIGH”, “MEDIUM” or “LOW”). We can see the number of cars that belong to each category.

data2$Pollution[data2$Pollution == 2] <- 'LOW'
data2$Pollution[data2$Pollution == 1] <- 'HIGH'
data2$Pollution[data2$Pollution == 3] <- 'MEDIUM'
data2$Pollution=as.factor(data2$Pollution)
levels(data2$Pollution)
## [1] "HIGH"   "LOW"    "MEDIUM"
table(data2$Pollution)
## 
##   HIGH    LOW MEDIUM 
##   2472   4494   9098

Now we create the data partition.

in_train = createDataPartition(data2$Pollution, p = 0.8, list = FALSE)  
train = data2[in_train,]
test = data2[-in_train,]
nrow(train)
## [1] 12853
nrow(test)
## [1] 3211
table(train$Pollution)/length(train$Pollution)
## 
##      HIGH       LOW    MEDIUM 
## 0.1538940 0.2797790 0.5663269

Correlated data: (matrix where the correlation between all the possible pairs of values of the data are displayed; easier to identify and visualize patterns)

# correlation 
ggcorr(train[,-21], label = T)

The variables are not highly correlated in this data base, so maybe the accuracies of the model are not going to be that high.

Bayes Classifiers

Benchmark

ctrl <- trainControl(method = "cv", number = 5,
                     classProbs = TRUE, 
                     verboseIter=T)

# We have many predictors, hence use penalized logistic regression
lrFit <- train(Pollution ~ ., 
               method = "glmnet",
               tuneGrid = expand.grid(alpha = seq(0, 1, 0.1), lambda = seq(0, .1, 0.02)),
               data = train,
               preProcess = c("center", "scale"),
               trControl = ctrl)
## + Fold1: alpha=0.0, lambda=0.1 
## - Fold1: alpha=0.0, lambda=0.1 
## + Fold1: alpha=0.1, lambda=0.1 
## - Fold1: alpha=0.1, lambda=0.1 
## + Fold1: alpha=0.2, lambda=0.1 
## - Fold1: alpha=0.2, lambda=0.1 
## + Fold1: alpha=0.3, lambda=0.1 
## - Fold1: alpha=0.3, lambda=0.1 
## + Fold1: alpha=0.4, lambda=0.1 
## - Fold1: alpha=0.4, lambda=0.1 
## + Fold1: alpha=0.5, lambda=0.1 
## - Fold1: alpha=0.5, lambda=0.1 
## + Fold1: alpha=0.6, lambda=0.1 
## - Fold1: alpha=0.6, lambda=0.1 
## + Fold1: alpha=0.7, lambda=0.1 
## - Fold1: alpha=0.7, lambda=0.1 
## + Fold1: alpha=0.8, lambda=0.1 
## - Fold1: alpha=0.8, lambda=0.1 
## + Fold1: alpha=0.9, lambda=0.1 
## - Fold1: alpha=0.9, lambda=0.1 
## + Fold1: alpha=1.0, lambda=0.1 
## - Fold1: alpha=1.0, lambda=0.1 
## + Fold2: alpha=0.0, lambda=0.1 
## - Fold2: alpha=0.0, lambda=0.1 
## + Fold2: alpha=0.1, lambda=0.1 
## - Fold2: alpha=0.1, lambda=0.1 
## + Fold2: alpha=0.2, lambda=0.1 
## - Fold2: alpha=0.2, lambda=0.1 
## + Fold2: alpha=0.3, lambda=0.1 
## - Fold2: alpha=0.3, lambda=0.1 
## + Fold2: alpha=0.4, lambda=0.1 
## - Fold2: alpha=0.4, lambda=0.1 
## + Fold2: alpha=0.5, lambda=0.1 
## - Fold2: alpha=0.5, lambda=0.1 
## + Fold2: alpha=0.6, lambda=0.1 
## - Fold2: alpha=0.6, lambda=0.1 
## + Fold2: alpha=0.7, lambda=0.1 
## - Fold2: alpha=0.7, lambda=0.1 
## + Fold2: alpha=0.8, lambda=0.1 
## - Fold2: alpha=0.8, lambda=0.1 
## + Fold2: alpha=0.9, lambda=0.1 
## - Fold2: alpha=0.9, lambda=0.1 
## + Fold2: alpha=1.0, lambda=0.1 
## - Fold2: alpha=1.0, lambda=0.1 
## + Fold3: alpha=0.0, lambda=0.1 
## - Fold3: alpha=0.0, lambda=0.1 
## + Fold3: alpha=0.1, lambda=0.1 
## - Fold3: alpha=0.1, lambda=0.1 
## + Fold3: alpha=0.2, lambda=0.1 
## - Fold3: alpha=0.2, lambda=0.1 
## + Fold3: alpha=0.3, lambda=0.1 
## - Fold3: alpha=0.3, lambda=0.1 
## + Fold3: alpha=0.4, lambda=0.1 
## - Fold3: alpha=0.4, lambda=0.1 
## + Fold3: alpha=0.5, lambda=0.1 
## - Fold3: alpha=0.5, lambda=0.1 
## + Fold3: alpha=0.6, lambda=0.1 
## - Fold3: alpha=0.6, lambda=0.1 
## + Fold3: alpha=0.7, lambda=0.1 
## - Fold3: alpha=0.7, lambda=0.1 
## + Fold3: alpha=0.8, lambda=0.1 
## - Fold3: alpha=0.8, lambda=0.1 
## + Fold3: alpha=0.9, lambda=0.1 
## - Fold3: alpha=0.9, lambda=0.1 
## + Fold3: alpha=1.0, lambda=0.1 
## - Fold3: alpha=1.0, lambda=0.1 
## + Fold4: alpha=0.0, lambda=0.1 
## - Fold4: alpha=0.0, lambda=0.1 
## + Fold4: alpha=0.1, lambda=0.1 
## - Fold4: alpha=0.1, lambda=0.1 
## + Fold4: alpha=0.2, lambda=0.1 
## - Fold4: alpha=0.2, lambda=0.1 
## + Fold4: alpha=0.3, lambda=0.1 
## - Fold4: alpha=0.3, lambda=0.1 
## + Fold4: alpha=0.4, lambda=0.1 
## - Fold4: alpha=0.4, lambda=0.1 
## + Fold4: alpha=0.5, lambda=0.1 
## - Fold4: alpha=0.5, lambda=0.1 
## + Fold4: alpha=0.6, lambda=0.1 
## - Fold4: alpha=0.6, lambda=0.1 
## + Fold4: alpha=0.7, lambda=0.1 
## - Fold4: alpha=0.7, lambda=0.1 
## + Fold4: alpha=0.8, lambda=0.1 
## - Fold4: alpha=0.8, lambda=0.1 
## + Fold4: alpha=0.9, lambda=0.1 
## - Fold4: alpha=0.9, lambda=0.1 
## + Fold4: alpha=1.0, lambda=0.1 
## - Fold4: alpha=1.0, lambda=0.1 
## + Fold5: alpha=0.0, lambda=0.1 
## - Fold5: alpha=0.0, lambda=0.1 
## + Fold5: alpha=0.1, lambda=0.1 
## - Fold5: alpha=0.1, lambda=0.1 
## + Fold5: alpha=0.2, lambda=0.1 
## - Fold5: alpha=0.2, lambda=0.1 
## + Fold5: alpha=0.3, lambda=0.1 
## - Fold5: alpha=0.3, lambda=0.1 
## + Fold5: alpha=0.4, lambda=0.1 
## - Fold5: alpha=0.4, lambda=0.1 
## + Fold5: alpha=0.5, lambda=0.1 
## - Fold5: alpha=0.5, lambda=0.1 
## + Fold5: alpha=0.6, lambda=0.1 
## - Fold5: alpha=0.6, lambda=0.1 
## + Fold5: alpha=0.7, lambda=0.1 
## - Fold5: alpha=0.7, lambda=0.1 
## + Fold5: alpha=0.8, lambda=0.1 
## - Fold5: alpha=0.8, lambda=0.1 
## + Fold5: alpha=0.9, lambda=0.1 
## - Fold5: alpha=0.9, lambda=0.1 
## + Fold5: alpha=1.0, lambda=0.1 
## - Fold5: alpha=1.0, lambda=0.1 
## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 1, lambda = 0 on full training set
# print(lrFit)
lrPred = predict(lrFit, test)
confusionMatrix(lrPred, test$Pollution)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction HIGH  LOW MEDIUM
##     HIGH    492    0      0
##     LOW       0  893      7
##     MEDIUM    2    5   1812
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9956          
##                  95% CI : (0.9927, 0.9976)
##     No Information Rate : 0.5665          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9924          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: HIGH Class: LOW Class: MEDIUM
## Sensitivity               0.9960     0.9944        0.9962
## Specificity               1.0000     0.9970        0.9950
## Pos Pred Value            1.0000     0.9922        0.9962
## Neg Pred Value            0.9993     0.9978        0.9950
## Prevalence                0.1538     0.2797        0.5665
## Detection Rate            0.1532     0.2781        0.5643
## Detection Prevalence      0.1532     0.2803        0.5665
## Balanced Accuracy         0.9980     0.9957        0.9956
table(train$Pollution)
## 
##   HIGH    LOW MEDIUM 
##   1978   3596   7279
maximum=max(table(test$Pollution))
accuracy=maximum/nrow(test)
accuracy
## [1] 0.5664902

There is too much noise in this model and the efficiency is not that good so we are going to omit it, too simple. We should consider as good accuracies the ones higher than this one.

LDA

lda.model= lda(Pollution ~ ., data=train, prior=c(0.15,0.28, 0.57))
lda.model
## Call:
## lda(Pollution ~ ., data = train, prior = c(0.15, 0.28, 0.57))
## 
## Prior probabilities of groups:
##   HIGH    LOW MEDIUM 
##   0.15   0.28   0.57 
## 
## Group means:
##           Price     Levy Manufacturer    Model Prod.year Category
## HIGH   13768.90 777.7204     25.84075 634.4075  2010.113 7.491405
## LOW    10774.04 578.9163     35.92047 708.2839  2012.139 7.353448
## MEDIUM 16830.01 617.4182     28.30870 632.8404  2011.770 7.360214
##        Leather.interior Fuel.type Engine.volume   Mileage Cylinders
## HIGH           1.879171  3.791709      3.052932  753029.7  5.931244
## LOW            1.701335  3.178254      2.033259  463084.2  4.174360
## MEDIUM         1.707515  4.111279      1.953716 1100495.2  4.038879
##        Gear.box.type Drive.wheels    Doors    Wheel    Color  Airbags
## HIGH        1.665319     1.824065 2.921638 1.038423 7.443377 8.458544
## LOW         1.465517     1.905451 2.989155 1.089544 9.695495 7.117631
## MEDIUM      1.503778     1.952466 2.953565 1.087512 9.053579 6.145899
##               New      Turbo  Fuel.idx
## HIGH   0.03791709 0.14307381 0.9757179
## LOW    0.03670745 0.00945495 0.3620189
## MEDIUM 0.02623987 0.12721528 0.9617338
## 
## Coefficients of linear discriminants:
##                            LD1           LD2
## Price            -2.046378e-06  1.461952e-05
## Levy              3.038647e-04  1.698808e-04
## Manufacturer     -2.426120e-03  2.563873e-03
## Model             6.036631e-04 -9.761718e-05
## Prod.year         1.152644e-02 -2.596967e-02
## Category          5.550186e-02  8.151887e-03
## Leather.interior  9.386334e-02  5.855300e-02
## Fuel.type        -8.498417e-02  2.079135e-02
## Engine.volume    -4.336888e-01 -9.498375e-01
## Mileage          -2.112005e-10  8.582862e-10
## Cylinders        -8.433674e-01 -1.331869e+00
## Gear.box.type    -4.398902e-02 -8.278581e-02
## Drive.wheels     -6.064327e-02 -2.495884e-01
## Doors            -1.238897e-01  5.057412e-03
## Wheel             1.236740e-01 -1.216467e-01
## Color             1.282027e-02  6.087599e-03
## Airbags          -1.905964e-02  2.692227e-02
## New              -1.212960e-01  1.617713e-01
## Turbo             5.403148e-02 -8.818177e-02
## Fuel.idx         -1.182585e+01  1.671821e+00
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8896 0.1104

The probabilities of the medium class are greater than the ones from the high and low classes. We can see two linear discriminants as we there are 3 groups.

prediction = predict(lda.model, newdata=test)$class
head(prediction)
## [1] MEDIUM MEDIUM HIGH   LOW    MEDIUM LOW   
## Levels: HIGH LOW MEDIUM
# performance by confusion matrix
# predictions in rows, true values in columns

confusionMatrix(prediction, test$Pollution)$table
##           Reference
## Prediction HIGH  LOW MEDIUM
##     HIGH    460    0     15
##     LOW       0  874      6
##     MEDIUM   34   24   1798
confusionMatrix(prediction, test$Pollution)$overall[1]
##  Accuracy 
## 0.9753971

Indeed, we have increased the accuracy with respect to the last model, nearly 97%. This result is probably due to the fact that the variable considered as target was created in the pre-processing part of the project as a combination of other variables and considering the influence of each car to the CO2 emissions.

LDA is using means and variances of each class in order to create a linear boundary (or separation) between them, which will be delimited by the coefficients. The first thing we see are the Prior probabilities of groups, which are the probabilities that already exist in the training data. Then we see the group means, which are the average predictor within each class. Finally, the coefficients of the linear discriminants (LD1 and LD2) are displayed. They provide the highest possible discrimination among various classes to find the linear combination of the features, which is what will separate them into classes of objects with best performance.

model <- lda(Pollution ~ ., data=train, prior=c(0.15,0.28, 0.57))
probability = predict(model, test)$posterior

roc.lda <- roc(test$Pollution,probability[,3])
auc(roc.lda) 
## Area under the curve: 0.8388
plot.roc(test$Pollution, probability[,3],col="deeppink", print.auc = TRUE,  auc.polygon=TRUE, grid=c(0.1, 0.2),
         grid.col=c("green", "red"), max.auc.polygon=TRUE,
         auc.polygon.col="lightblue", print.thres=TRUE)

After plotting the ROC curve, we see the illustration of the the sensitivity and specificity of each of the cut points of the test. The AUC (Area Under the Curve) is relatively high, so we can see that it is a good model to predict in our data base.

QDA

qda.model = qda(Pollution ~ ., data = train, prior=c(0.15,0.28, 0.57))
qda.model
## Call:
## qda(Pollution ~ ., data = train, prior = c(0.15, 0.28, 0.57))
## 
## Prior probabilities of groups:
##   HIGH    LOW MEDIUM 
##   0.15   0.28   0.57 
## 
## Group means:
##           Price     Levy Manufacturer    Model Prod.year Category
## HIGH   13768.90 777.7204     25.84075 634.4075  2010.113 7.491405
## LOW    10774.04 578.9163     35.92047 708.2839  2012.139 7.353448
## MEDIUM 16830.01 617.4182     28.30870 632.8404  2011.770 7.360214
##        Leather.interior Fuel.type Engine.volume   Mileage Cylinders
## HIGH           1.879171  3.791709      3.052932  753029.7  5.931244
## LOW            1.701335  3.178254      2.033259  463084.2  4.174360
## MEDIUM         1.707515  4.111279      1.953716 1100495.2  4.038879
##        Gear.box.type Drive.wheels    Doors    Wheel    Color  Airbags
## HIGH        1.665319     1.824065 2.921638 1.038423 7.443377 8.458544
## LOW         1.465517     1.905451 2.989155 1.089544 9.695495 7.117631
## MEDIUM      1.503778     1.952466 2.953565 1.087512 9.053579 6.145899
##               New      Turbo  Fuel.idx
## HIGH   0.03791709 0.14307381 0.9757179
## LOW    0.03670745 0.00945495 0.3620189
## MEDIUM 0.02623987 0.12721528 0.9617338
prediction = predict(qda.model, newdata=test)$class
head(prediction)
## [1] MEDIUM MEDIUM HIGH   LOW    MEDIUM LOW   
## Levels: HIGH LOW MEDIUM
confusionMatrix(prediction, test$Pollution)$table
##           Reference
## Prediction HIGH  LOW MEDIUM
##     HIGH    483    6     35
##     LOW       0  887     30
##     MEDIUM   11    5   1754
confusionMatrix(prediction, test$Pollution)$overall[1]
##  Accuracy 
## 0.9729056
model <- qda(Pollution ~ ., data=train, prior=c(0.15,0.28, 0.57))

We can clearly see that we reach more or less the same conclusions than with the previous analysis, but with a bit less of accuracy. As with LDA, the observation of each class is drawn from a normal distribution. However, the main difference is that QDA assumes that each class has its own covariance matrix and so this will mean an increase in the number of parameters.

probability = predict(model, test)$posterior

roc.lda <- roc(test$Pollution,probability[,3])
auc(roc.lda) 
## Area under the curve: 0.8086
plot.roc(test$Pollution, probability[,3],col="deeppink", print.auc = TRUE,  auc.polygon=TRUE, grid=c(0.1, 0.2),
         grid.col=c("green", "red"), max.auc.polygon=TRUE,
         auc.polygon.col="lightblue", print.thres=TRUE)

The AUC is a bit lower in this descriptive analysis and we can see that there is no huge difference between the data set and the subset.

Decision Trees

# Hyper-parameters
control = rpart.control(minsplit = 30, maxdepth = 10, cp=0.01)

# minsplit: minimum number of observations in a node before before a split
# maxdepth: maximum depth of any node of the final tree
# cp: degree of complexity, the smaller the more branches
model = Pollution ~.
dtFit <- rpart(model, data=train, method = "class", control = control)
summary(dtFit)
## Call:
## rpart(formula = model, data = train, method = "class", control = control)
##   n= 12853 
## 
##           CP nsplit  rel error     xerror        xstd
## 1 0.60064586      0 1.00000000 1.00000000 0.010079758
## 2 0.33548619      1 0.39935414 0.39935414 0.007696596
## 3 0.01166128      2 0.06386796 0.06386796 0.003337787
## 4 0.01000000      4 0.04054539 0.03946896 0.002638125
## 
## Variable importance
##      Fuel.idx     Cylinders Engine.volume     Fuel.type  Manufacturer 
##            31            24            18             8             7 
##       Mileage          Levy         Model  Drive.wheels Gear.box.type 
##             3             2             2             2             2 
##     Prod.year 
##             1 
## 
## Node number 1: 12853 observations,    complexity param=0.6006459
##   predicted class=MEDIUM  expected loss=0.4336731  P(node) =1
##     class counts:  1978  3596  7279
##    probabilities: 0.154 0.280 0.566 
##   left son=2 (3724 obs) right son=3 (9129 obs)
##   Primary splits:
##       Fuel.idx      < 0.795   to the left,  improve=3869.8470, (0 missing)
##       Cylinders     < 4.5     to the right, improve=1952.6340, (0 missing)
##       Engine.volume < 2.55    to the right, improve=1609.0710, (0 missing)
##       Fuel.type     < 4.5     to the left,  improve=1553.7830, (0 missing)
##       Manufacturer  < 50.5    to the right, improve= 498.9054, (0 missing)
##   Surrogate splits:
##       Fuel.type     < 4.5     to the left,  agree=0.777, adj=0.230, (0 split)
##       Manufacturer  < 50.5    to the right, agree=0.738, adj=0.095, (0 split)
##       Mileage       < 310032  to the right, agree=0.735, adj=0.086, (0 split)
##       Gear.box.type < 3.5     to the right, agree=0.723, adj=0.044, (0 split)
## 
## Node number 2: 3724 observations,    complexity param=0.01166128
##   predicted class=LOW     expected loss=0.05075188  P(node) =0.2897378
##     class counts:     2  3535   187
##    probabilities: 0.001 0.949 0.050 
##   left son=4 (3415 obs) right son=5 (309 obs)
##   Primary splits:
##       Prod.year     < 2006.5  to the right, improve=124.66870, (0 missing)
##       Fuel.type     < 2       to the right, improve= 72.22757, (0 missing)
##       Cylinders     < 4.5     to the left,  improve= 70.35871, (0 missing)
##       Fuel.idx      < 0.435   to the left,  improve= 60.80848, (0 missing)
##       Engine.volume < 2.55    to the left,  improve= 56.22645, (0 missing)
##   Surrogate splits:
##       Fuel.type    < 2       to the right, agree=0.963, adj=0.553, (0 split)
##       Drive.wheels < 2.5     to the left,  agree=0.924, adj=0.087, (0 split)
##       Manufacturer < 5.5     to the right, agree=0.923, adj=0.071, (0 split)
##       Model        < 1152    to the left,  agree=0.921, adj=0.049, (0 split)
##       Mileage      < 2564160 to the left,  agree=0.918, adj=0.006, (0 split)
## 
## Node number 3: 9129 observations,    complexity param=0.3354862
##   predicted class=MEDIUM  expected loss=0.2231351  P(node) =0.7102622
##     class counts:  1976    61  7092
##    probabilities: 0.216 0.007 0.777 
##   left son=6 (1876 obs) right son=7 (7253 obs)
##   Primary splits:
##       Cylinders     < 4.5     to the right, improve=2863.0490, (0 missing)
##       Engine.volume < 2.55    to the right, improve=2331.7610, (0 missing)
##       Drive.wheels  < 1.5     to the left,  improve= 486.8362, (0 missing)
##       Levy          < 913     to the right, improve= 484.9521, (0 missing)
##       Manufacturer  < 4.5     to the left,  improve= 421.9788, (0 missing)
##   Surrogate splits:
##       Engine.volume < 2.75    to the right, agree=0.948, adj=0.746, (0 split)
##       Manufacturer  < 4.5     to the left,  agree=0.829, adj=0.169, (0 split)
##       Levy          < 1381    to the right, agree=0.814, adj=0.093, (0 split)
##       Model         < 1172.5  to the right, agree=0.812, adj=0.087, (0 split)
##       Drive.wheels  < 1.5     to the left,  agree=0.810, adj=0.076, (0 split)
## 
## Node number 4: 3415 observations
##   predicted class=LOW     expected loss=0.01171303  P(node) =0.2656967
##     class counts:     1  3375    39
##    probabilities: 0.000 0.988 0.011 
## 
## Node number 5: 309 observations,    complexity param=0.01166128
##   predicted class=LOW     expected loss=0.4822006  P(node) =0.02404108
##     class counts:     1   160   148
##    probabilities: 0.003 0.518 0.479 
##   left son=10 (174 obs) right son=11 (135 obs)
##   Primary splits:
##       Cylinders        < 4.5     to the left,  improve=120.30840, (0 missing)
##       Engine.volume    < 2.45    to the left,  improve=101.43390, (0 missing)
##       Price            < 11552   to the left,  improve= 29.50067, (0 missing)
##       Leather.interior < 1.5     to the left,  improve= 24.67542, (0 missing)
##       Manufacturer     < 37.5    to the right, improve= 22.60961, (0 missing)
##   Surrogate splits:
##       Engine.volume    < 2.45    to the left,  agree=0.926, adj=0.830, (0 split)
##       Leather.interior < 1.5     to the left,  agree=0.715, adj=0.348, (0 split)
##       Price            < 10819.5 to the left,  agree=0.712, adj=0.341, (0 split)
##       Airbags          < 6.5     to the left,  agree=0.683, adj=0.274, (0 split)
##       Gear.box.type    < 2.5     to the left,  agree=0.670, adj=0.244, (0 split)
## 
## Node number 6: 1876 observations
##   predicted class=HIGH    expected loss=0.001599147  P(node) =0.1459581
##     class counts:  1873     0     3
##    probabilities: 0.998 0.000 0.002 
## 
## Node number 7: 7253 observations
##   predicted class=MEDIUM  expected loss=0.02261133  P(node) =0.5643041
##     class counts:   103    61  7089
##    probabilities: 0.014 0.008 0.977 
## 
## Node number 10: 174 observations
##   predicted class=LOW     expected loss=0.09195402  P(node) =0.0135377
##     class counts:     0   158    16
##    probabilities: 0.000 0.908 0.092 
## 
## Node number 11: 135 observations
##   predicted class=MEDIUM  expected loss=0.02222222  P(node) =0.01050338
##     class counts:     1     2   132
##    probabilities: 0.007 0.015 0.978
rpart.plot(dtFit, digits=3)

We start at the root node (top of the graph).

At the top, we can see the overall probability for a car to be pollutant. It shows the proportion of each category of pollution that we have. 15 percent for the HIGH , 28 percent for the LOW and 56 percent for the MEDIUM. This node asks whether the index of Fuel is greater than 0.795 or not, and then go down to the root’s child nodes (depth 2) depending on the answer. If we look at the graph, we can clearly follow the path depending on the different characteristics that we have for a car, to finally arrive to a conclusion of whether the car is pollutant or not ( this happens when we reach a leaf node).

control = rpart.control(minsplit = 40, maxdepth = 12, cp=0.001)
dtFit <- rpart(model, data=train, method = "class", control = control)
summary(dtFit)
## Call:
## rpart(formula = model, data = train, method = "class", control = control)
##   n= 12853 
## 
##            CP nsplit   rel error      xerror        xstd
## 1 0.600645856      0 1.000000000 1.000000000 0.010079758
## 2 0.335486186      1 0.399354144 0.399354144 0.007696596
## 3 0.011661285      2 0.063867958 0.063867958 0.003337787
## 4 0.009508432      4 0.040545389 0.043774668 0.002775658
## 5 0.007176175      5 0.031036957 0.031754575 0.002370328
## 6 0.006458558      6 0.023860782 0.028525296 0.002248167
## 7 0.001973448      7 0.017402225 0.018299247 0.001804692
## 8 0.001674441      9 0.013455328 0.013634733 0.001559380
## 9 0.001000000     12 0.008432006 0.008073197 0.001201373
## 
## Variable importance
##      Fuel.idx     Cylinders Engine.volume     Fuel.type  Manufacturer 
##            30            24            19             7             7 
##       Mileage          Levy         Model  Drive.wheels Gear.box.type 
##             3             2             2             2             2 
##     Prod.year         Price 
##             1             1 
## 
## Node number 1: 12853 observations,    complexity param=0.6006459
##   predicted class=MEDIUM  expected loss=0.4336731  P(node) =1
##     class counts:  1978  3596  7279
##    probabilities: 0.154 0.280 0.566 
##   left son=2 (3724 obs) right son=3 (9129 obs)
##   Primary splits:
##       Fuel.idx      < 0.795    to the left,  improve=3869.8470, (0 missing)
##       Cylinders     < 4.5      to the right, improve=1952.6340, (0 missing)
##       Engine.volume < 2.55     to the right, improve=1609.0710, (0 missing)
##       Fuel.type     < 4.5      to the left,  improve=1553.7830, (0 missing)
##       Manufacturer  < 50.5     to the right, improve= 498.9054, (0 missing)
##   Surrogate splits:
##       Fuel.type     < 4.5      to the left,  agree=0.777, adj=0.230, (0 split)
##       Manufacturer  < 50.5     to the right, agree=0.738, adj=0.095, (0 split)
##       Mileage       < 310032   to the right, agree=0.735, adj=0.086, (0 split)
##       Gear.box.type < 3.5      to the right, agree=0.723, adj=0.044, (0 split)
## 
## Node number 2: 3724 observations,    complexity param=0.01166128
##   predicted class=LOW     expected loss=0.05075188  P(node) =0.2897378
##     class counts:     2  3535   187
##    probabilities: 0.001 0.949 0.050 
##   left son=4 (3415 obs) right son=5 (309 obs)
##   Primary splits:
##       Prod.year     < 2006.5   to the right, improve=124.66870, (0 missing)
##       Fuel.type     < 2        to the right, improve= 72.22757, (0 missing)
##       Cylinders     < 4.5      to the left,  improve= 70.35871, (0 missing)
##       Fuel.idx      < 0.435    to the left,  improve= 60.80848, (0 missing)
##       Engine.volume < 2.55     to the left,  improve= 56.22645, (0 missing)
##   Surrogate splits:
##       Fuel.type    < 2        to the right, agree=0.963, adj=0.553, (0 split)
##       Drive.wheels < 2.5      to the left,  agree=0.924, adj=0.087, (0 split)
##       Manufacturer < 5.5      to the right, agree=0.923, adj=0.071, (0 split)
##       Model        < 1152     to the left,  agree=0.921, adj=0.049, (0 split)
##       Mileage      < 2564160  to the left,  agree=0.918, adj=0.006, (0 split)
## 
## Node number 3: 9129 observations,    complexity param=0.3354862
##   predicted class=MEDIUM  expected loss=0.2231351  P(node) =0.7102622
##     class counts:  1976    61  7092
##    probabilities: 0.216 0.007 0.777 
##   left son=6 (1876 obs) right son=7 (7253 obs)
##   Primary splits:
##       Cylinders     < 4.5      to the right, improve=2863.0490, (0 missing)
##       Engine.volume < 2.55     to the right, improve=2331.7610, (0 missing)
##       Drive.wheels  < 1.5      to the left,  improve= 486.8362, (0 missing)
##       Levy          < 913      to the right, improve= 484.9521, (0 missing)
##       Manufacturer  < 4.5      to the left,  improve= 421.9788, (0 missing)
##   Surrogate splits:
##       Engine.volume < 2.75     to the right, agree=0.948, adj=0.746, (0 split)
##       Manufacturer  < 4.5      to the left,  agree=0.829, adj=0.169, (0 split)
##       Levy          < 1381     to the right, agree=0.814, adj=0.093, (0 split)
##       Model         < 1172.5   to the right, agree=0.812, adj=0.087, (0 split)
##       Drive.wheels  < 1.5      to the left,  agree=0.810, adj=0.076, (0 split)
## 
## Node number 4: 3415 observations,    complexity param=0.001674441
##   predicted class=LOW     expected loss=0.01171303  P(node) =0.2656967
##     class counts:     1  3375    39
##    probabilities: 0.000 0.988 0.011 
##   left son=8 (3399 obs) right son=9 (16 obs)
##   Primary splits:
##       Fuel.type < 2        to the right, improve=4.246581, (0 missing)
##       Doors     < 1.5      to the right, improve=4.194169, (0 missing)
##       Fuel.idx  < 0.435    to the left,  improve=3.704599, (0 missing)
##       Price     < 34340.5  to the left,  improve=2.239090, (0 missing)
##       Cylinders < 6.5      to the left,  improve=1.847703, (0 missing)
## 
## Node number 5: 309 observations,    complexity param=0.01166128
##   predicted class=LOW     expected loss=0.4822006  P(node) =0.02404108
##     class counts:     1   160   148
##    probabilities: 0.003 0.518 0.479 
##   left son=10 (174 obs) right son=11 (135 obs)
##   Primary splits:
##       Cylinders        < 4.5      to the left,  improve=120.30840, (0 missing)
##       Engine.volume    < 2.45     to the left,  improve=101.43390, (0 missing)
##       Price            < 11552    to the left,  improve= 29.50067, (0 missing)
##       Leather.interior < 1.5      to the left,  improve= 24.67542, (0 missing)
##       Manufacturer     < 37.5     to the right, improve= 22.60961, (0 missing)
##   Surrogate splits:
##       Engine.volume    < 2.45     to the left,  agree=0.926, adj=0.830, (0 split)
##       Leather.interior < 1.5      to the left,  agree=0.715, adj=0.348, (0 split)
##       Price            < 10819.5  to the left,  agree=0.712, adj=0.341, (0 split)
##       Airbags          < 6.5      to the left,  agree=0.683, adj=0.274, (0 split)
##       Gear.box.type    < 2.5      to the left,  agree=0.670, adj=0.244, (0 split)
## 
## Node number 6: 1876 observations
##   predicted class=HIGH    expected loss=0.001599147  P(node) =0.1459581
##     class counts:  1873     0     3
##    probabilities: 0.998 0.000 0.002 
## 
## Node number 7: 7253 observations,    complexity param=0.009508432
##   predicted class=MEDIUM  expected loss=0.02261133  P(node) =0.5643041
##     class counts:   103    61  7089
##    probabilities: 0.014 0.008 0.977 
##   left son=14 (153 obs) right son=15 (7100 obs)
##   Primary splits:
##       Engine.volume < 2.6      to the right, improve=134.043900, (0 missing)
##       Cylinders     < 3.5      to the left,  improve= 81.845530, (0 missing)
##       Levy          < 1097     to the right, improve=  6.559693, (0 missing)
##       Model         < 1143     to the right, improve=  2.736578, (0 missing)
##       Prod.year     < 2011.5   to the left,  improve=  2.482419, (0 missing)
## 
## Node number 8: 3399 observations,    complexity param=0.001674441
##   predicted class=LOW     expected loss=0.01000294  P(node) =0.2644519
##     class counts:     1  3365    33
##    probabilities: 0.000 0.990 0.010 
##   left son=16 (2837 obs) right son=17 (562 obs)
##   Primary splits:
##       Fuel.idx      < 0.46     to the left,  improve=2.6587290, (0 missing)
##       Fuel.type     < 3.5      to the left,  improve=2.3201490, (0 missing)
##       Cylinders     < 6.5      to the left,  improve=1.8708180, (0 missing)
##       Engine.volume < 2.35     to the left,  improve=0.8907350, (0 missing)
##       Drive.wheels  < 2.5      to the left,  improve=0.8886964, (0 missing)
##   Surrogate splits:
##       Fuel.type     < 3.5      to the left,  agree=0.981, adj=0.886, (0 split)
##       Model         < 999      to the left,  agree=0.889, adj=0.327, (0 split)
##       Mileage       < 437213.5 to the left,  agree=0.869, adj=0.210, (0 split)
##       Engine.volume < 1.05     to the right, agree=0.835, adj=0.004, (0 split)
## 
## Node number 9: 16 observations
##   predicted class=LOW     expected loss=0.375  P(node) =0.001244846
##     class counts:     0    10     6
##    probabilities: 0.000 0.625 0.375 
## 
## Node number 10: 174 observations
##   predicted class=LOW     expected loss=0.09195402  P(node) =0.0135377
##     class counts:     0   158    16
##    probabilities: 0.000 0.908 0.092 
## 
## Node number 11: 135 observations
##   predicted class=MEDIUM  expected loss=0.02222222  P(node) =0.01050338
##     class counts:     1     2   132
##    probabilities: 0.007 0.015 0.978 
## 
## Node number 14: 153 observations,    complexity param=0.007176175
##   predicted class=HIGH    expected loss=0.3267974  P(node) =0.01190384
##     class counts:   103     0    50
##    probabilities: 0.673 0.000 0.327 
##   left son=28 (103 obs) right son=29 (50 obs)
##   Primary splits:
##       Manufacturer  < 50       to the left,  improve=48.80570, (0 missing)
##       Airbags       < 7        to the left,  improve=36.00009, (0 missing)
##       Model         < 1047.5   to the left,  improve=29.02608, (0 missing)
##       Price         < 4312.5   to the right, improve=26.12838, (0 missing)
##       Engine.volume < 2.75     to the right, improve=23.62806, (0 missing)
##   Surrogate splits:
##       Price        < 4312.5   to the right, agree=0.863, adj=0.58, (0 split)
##       Airbags      < 7        to the left,  agree=0.863, adj=0.58, (0 split)
##       Model        < 1047.5   to the left,  agree=0.830, adj=0.48, (0 split)
##       Levy         < 849      to the right, agree=0.797, adj=0.38, (0 split)
##       Drive.wheels < 2.5      to the left,  agree=0.797, adj=0.38, (0 split)
## 
## Node number 15: 7100 observations,    complexity param=0.006458558
##   predicted class=MEDIUM  expected loss=0.008591549  P(node) =0.5524002
##     class counts:     0    61  7039
##    probabilities: 0.000 0.009 0.991 
##   left son=30 (86 obs) right son=31 (7014 obs)
##   Primary splits:
##       Cylinders        < 3.5      to the left,  improve=85.486710, (0 missing)
##       Engine.volume    < 1.05     to the left,  improve=28.243920, (0 missing)
##       Model            < 1143     to the right, improve= 2.163930, (0 missing)
##       Wheel            < 1.5      to the right, improve= 1.390551, (0 missing)
##       Leather.interior < 1.5      to the left,  improve= 1.205498, (0 missing)
##   Surrogate splits:
##       Engine.volume < 0.9      to the left,  agree=0.988, adj=0.023, (0 split)
## 
## Node number 16: 2837 observations
##   predicted class=LOW     expected loss=0.001057455  P(node) =0.2207267
##     class counts:     0  2834     3
##    probabilities: 0.000 0.999 0.001 
## 
## Node number 17: 562 observations,    complexity param=0.001674441
##   predicted class=LOW     expected loss=0.05516014  P(node) =0.0437252
##     class counts:     1   531    30
##    probabilities: 0.002 0.945 0.053 
##   left son=34 (534 obs) right son=35 (28 obs)
##   Primary splits:
##       Engine.volume < 2.2      to the left,  improve=52.71305, (0 missing)
##       Airbags       < 5        to the left,  improve=17.11212, (0 missing)
##       Model         < 821.5    to the right, improve=15.64918, (0 missing)
##       Levy          < 225      to the right, improve=10.96933, (0 missing)
##       Color         < 7        to the right, improve=10.96586, (0 missing)
##   Surrogate splits:
##       Cylinders    < 4.5      to the left,  agree=0.959, adj=0.179, (0 split)
##       Category     < 10.5     to the left,  agree=0.957, adj=0.143, (0 split)
##       Manufacturer < 52.5     to the left,  agree=0.956, adj=0.107, (0 split)
##       Model        < 307      to the right, agree=0.956, adj=0.107, (0 split)
##       Price        < 37500    to the left,  agree=0.954, adj=0.071, (0 split)
## 
## Node number 28: 103 observations
##   predicted class=HIGH    expected loss=0.04854369  P(node) =0.008013693
##     class counts:    98     0     5
##    probabilities: 0.951 0.000 0.049 
## 
## Node number 29: 50 observations
##   predicted class=MEDIUM  expected loss=0.1  P(node) =0.003890142
##     class counts:     5     0    45
##    probabilities: 0.100 0.000 0.900 
## 
## Node number 30: 86 observations,    complexity param=0.001973448
##   predicted class=LOW     expected loss=0.2906977  P(node) =0.006691045
##     class counts:     0    61    25
##    probabilities: 0.000 0.709 0.291 
##   left son=60 (37 obs) right son=61 (49 obs)
##   Primary splits:
##       Engine.volume < 1.1      to the left,  improve=10.975320, (0 missing)
##       Manufacturer  < 48       to the right, improve= 5.306386, (0 missing)
##       Model         < 832.5    to the right, improve= 3.994005, (0 missing)
##       Cylinders     < 2.5      to the left,  improve= 2.797629, (0 missing)
##       Levy          < 643.5    to the left,  improve= 2.526100, (0 missing)
##   Surrogate splits:
##       Model        < 859      to the right, agree=0.791, adj=0.514, (0 split)
##       Manufacturer < 49.5     to the right, agree=0.756, adj=0.432, (0 split)
##       Cylinders    < 2.5      to the right, agree=0.733, adj=0.378, (0 split)
##       Prod.year    < 2014.5   to the right, agree=0.663, adj=0.216, (0 split)
##       Levy         < 422.5    to the right, agree=0.640, adj=0.162, (0 split)
## 
## Node number 31: 7014 observations
##   predicted class=MEDIUM  expected loss=0  P(node) =0.5457092
##     class counts:     0     0  7014
##    probabilities: 0.000 0.000 1.000 
## 
## Node number 34: 534 observations
##   predicted class=LOW     expected loss=0.005617978  P(node) =0.04154672
##     class counts:     1   531     2
##    probabilities: 0.002 0.994 0.004 
## 
## Node number 35: 28 observations
##   predicted class=MEDIUM  expected loss=0  P(node) =0.00217848
##     class counts:     0     0    28
##    probabilities: 0.000 0.000 1.000 
## 
## Node number 60: 37 observations
##   predicted class=LOW     expected loss=0  P(node) =0.002878705
##     class counts:     0    37     0
##    probabilities: 0.000 1.000 0.000 
## 
## Node number 61: 49 observations,    complexity param=0.001973448
##   predicted class=MEDIUM  expected loss=0.4897959  P(node) =0.00381234
##     class counts:     0    24    25
##    probabilities: 0.000 0.490 0.510 
##   left son=122 (27 obs) right son=123 (22 obs)
##   Primary splits:
##       Cylinders     < 2.5      to the left,  improve=19.156460, (0 missing)
##       Engine.volume < 1.25     to the right, improve= 8.675402, (0 missing)
##       Manufacturer  < 29       to the left,  improve= 7.546939, (0 missing)
##       Model         < 737.5    to the left,  improve= 7.432940, (0 missing)
##       Prod.year     < 2011     to the left,  improve= 6.837164, (0 missing)
##   Surrogate splits:
##       Engine.volume < 1.25     to the right, agree=0.837, adj=0.636, (0 split)
##       Prod.year     < 2008.5   to the left,  agree=0.816, adj=0.591, (0 split)
##       Levy          < 253      to the left,  agree=0.796, adj=0.545, (0 split)
##       Manufacturer  < 34       to the left,  agree=0.776, adj=0.500, (0 split)
##       Mileage       < 130500   to the right, agree=0.735, adj=0.409, (0 split)
## 
## Node number 122: 27 observations
##   predicted class=LOW     expected loss=0.1111111  P(node) =0.002100677
##     class counts:     0    24     3
##    probabilities: 0.000 0.889 0.111 
## 
## Node number 123: 22 observations
##   predicted class=MEDIUM  expected loss=0  P(node) =0.001711663
##     class counts:     0     0    22
##    probabilities: 0.000 0.000 1.000
rpart.plot(dtFit, digits=3)

In this tree, we notice that we have more nodes (the tree is now full). We have set the complexity parameters differently. Still, we start at the root node (initial split) and follow the path depending on the car features to know the chance of it of being too pollutant, medium, or low. The branches link the nodes together and refer to the corresponding variable. The decision tree is a very useful tool in order to do simple predictions about our data set. It allows us to do a deeper analysis and compute estimations that with an exploratory analysis we are not able to obtain.

Prediction

dtPred <- predict(dtFit, test, type = "class")
dtProb <- predict(dtFit, test, type = "prob")
threshold = 0.2
dtPred = rep("HIGH", nrow(test))
dtPred[which(dtProb[,2] > threshold)] = "LOW"
dtPred[which(dtProb[,3] > threshold)] = "MEDIUM"
confusionMatrix(factor(dtPred), test$Pollution)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction HIGH  LOW MEDIUM
##     HIGH    494    0      2
##     LOW       0  896      5
##     MEDIUM    0    2   1812
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9972          
##                  95% CI : (0.9947, 0.9987)
##     No Information Rate : 0.5665          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9951          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: HIGH Class: LOW Class: MEDIUM
## Sensitivity               1.0000     0.9978        0.9962
## Specificity               0.9993     0.9978        0.9986
## Pos Pred Value            0.9960     0.9945        0.9989
## Neg Pred Value            1.0000     0.9991        0.9950
## Prevalence                0.1538     0.2797        0.5665
## Detection Rate            0.1538     0.2790        0.5643
## Detection Prevalence      0.1545     0.2806        0.5649
## Balanced Accuracy         0.9996     0.9978        0.9974

Regarding accuracy, we obtain a very similar accuracy to the last models, which we have seen is a good accuracy in our database. With the confusion matrix we can better evaluate the classification performance. The idea is to count the number of times True instances are classified as False.

With Caret

caret.fit <- train(model, 
                   data = train, 
                   method = "rpart",
                   control=rpart.control(minsplit = 40, maxdepth = 12),
                   trControl = trainControl(method = "cv", number = 5),
                   tuneLength=10)
caret.fit
## CART 
## 
## 12853 samples
##    20 predictor
##     3 classes: 'HIGH', 'LOW', 'MEDIUM' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10282, 10282, 10282, 10282, 10284 
## Resampling results across tuning parameters:
## 
##   cp            Accuracy   Kappa    
##   0.0003588088  0.9956426  0.9924582
##   0.0011661285  0.9958762  0.9928679
##   0.0016744409  0.9947088  0.9908569
##   0.0019734482  0.9939309  0.9895153
##   0.0064585576  0.9898848  0.9824924
##   0.0071761751  0.9877839  0.9788796
##   0.0095084320  0.9853724  0.9746656
##   0.0116612845  0.9760353  0.9584628
##   0.3354861859  0.8845307  0.7784968
##   0.6006458558  0.7202366  0.3944380
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.001166128.
rpart.plot(caret.fit$finalModel)

The same thing can be done using “Caret”. As we see the re-sampling was done with the cross-validation method (as we set it in the control argument), with 5 folds. The accuracy of the results varies from 60% to 99%. The one selected to obtain the best model was, of course, the largest one with a complexity parameter (cp) of 0.001255831. The accuracy is still very high (good).

dtProb <- predict(caret.fit, test, type = "prob")
threshold = 0.2
dtPred = rep("HIGH", nrow(test))
dtPred[which(dtProb[,2] > threshold)] = "LOW"
dtPred[which(dtProb[,3] > threshold)] = "MEDIUM"
confusionMatrix(factor(dtPred), test$Pollution)$table
##           Reference
## Prediction HIGH  LOW MEDIUM
##     HIGH    494    0      2
##     LOW       0  896      5
##     MEDIUM    0    2   1812
CM = confusionMatrix(factor(dtPred), test$Pollution)$table

Random Forest

rf.train <- randomForest(Pollution ~., data=train,ntree=200
                         ,mtry=10,importance=TRUE, do.trace=T, cutoff=c(0.3, 0.2,0.3))
## ntree      OOB      1      2      3
##     1:   0.45%  0.27%  0.31%  0.56%
##     2:   0.42%  0.41%  0.24%  0.51%
##     3:   0.30%  0.33%  0.11%  0.39%
##     4:   0.35%  0.42%  0.20%  0.41%
##     5:   0.33%  0.34%  0.18%  0.40%
##     6:   0.37%  0.38%  0.21%  0.46%
##     7:   0.32%  0.37%  0.06%  0.43%
##     8:   0.26%  0.21%  0.06%  0.37%
##     9:   0.30%  0.26%  0.08%  0.42%
##    10:   0.28%  0.20%  0.08%  0.40%
##    11:   0.25%  0.20%  0.06%  0.36%
##    12:   0.23%  0.15%  0.03%  0.36%
##    13:   0.23%  0.15%  0.03%  0.34%
##    14:   0.20%  0.15%  0.03%  0.30%
##    15:   0.22%  0.15%  0.03%  0.33%
##    16:   0.19%  0.15%  0.03%  0.29%
##    17:   0.19%  0.15%  0.03%  0.27%
##    18:   0.19%  0.20%  0.03%  0.27%
##    19:   0.19%  0.20%  0.03%  0.27%
##    20:   0.20%  0.15%  0.03%  0.30%
##    21:   0.19%  0.15%  0.03%  0.27%
##    22:   0.19%  0.10%  0.03%  0.29%
##    23:   0.18%  0.10%  0.03%  0.27%
##    24:   0.19%  0.10%  0.03%  0.29%
##    25:   0.18%  0.10%  0.03%  0.27%
##    26:   0.18%  0.15%  0.03%  0.26%
##    27:   0.19%  0.15%  0.03%  0.29%
##    28:   0.19%  0.15%  0.03%  0.27%
##    29:   0.18%  0.15%  0.03%  0.26%
##    30:   0.17%  0.10%  0.03%  0.26%
##    31:   0.16%  0.10%  0.03%  0.25%
##    32:   0.16%  0.10%  0.03%  0.25%
##    33:   0.16%  0.15%  0.03%  0.23%
##    34:   0.15%  0.10%  0.03%  0.22%
##    35:   0.14%  0.10%  0.03%  0.21%
##    36:   0.14%  0.10%  0.03%  0.21%
##    37:   0.13%  0.10%  0.03%  0.19%
##    38:   0.16%  0.10%  0.03%  0.25%
##    39:   0.16%  0.10%  0.03%  0.23%
##    40:   0.16%  0.10%  0.03%  0.23%
##    41:   0.15%  0.10%  0.03%  0.22%
##    42:   0.17%  0.15%  0.03%  0.25%
##    43:   0.17%  0.15%  0.03%  0.25%
##    44:   0.16%  0.10%  0.03%  0.25%
##    45:   0.17%  0.10%  0.03%  0.26%
##    46:   0.17%  0.10%  0.03%  0.26%
##    47:   0.17%  0.10%  0.03%  0.26%
##    48:   0.18%  0.15%  0.03%  0.26%
##    49:   0.18%  0.15%  0.03%  0.26%
##    50:   0.16%  0.10%  0.03%  0.25%
##    51:   0.16%  0.10%  0.03%  0.25%
##    52:   0.17%  0.15%  0.03%  0.25%
##    53:   0.17%  0.10%  0.03%  0.26%
##    54:   0.16%  0.10%  0.03%  0.25%
##    55:   0.17%  0.10%  0.03%  0.26%
##    56:   0.16%  0.10%  0.03%  0.25%
##    57:   0.16%  0.15%  0.03%  0.23%
##    58:   0.15%  0.10%  0.03%  0.22%
##    59:   0.16%  0.10%  0.03%  0.23%
##    60:   0.16%  0.10%  0.03%  0.23%
##    61:   0.16%  0.10%  0.03%  0.25%
##    62:   0.16%  0.10%  0.03%  0.23%
##    63:   0.15%  0.10%  0.03%  0.22%
##    64:   0.15%  0.10%  0.03%  0.22%
##    65:   0.15%  0.10%  0.03%  0.22%
##    66:   0.15%  0.10%  0.03%  0.22%
##    67:   0.16%  0.15%  0.03%  0.23%
##    68:   0.16%  0.15%  0.03%  0.23%
##    69:   0.15%  0.10%  0.03%  0.22%
##    70:   0.15%  0.10%  0.03%  0.22%
##    71:   0.15%  0.10%  0.03%  0.22%
##    72:   0.16%  0.10%  0.03%  0.23%
##    73:   0.14%  0.10%  0.03%  0.21%
##    74:   0.16%  0.10%  0.03%  0.23%
##    75:   0.15%  0.10%  0.03%  0.22%
##    76:   0.14%  0.10%  0.03%  0.21%
##    77:   0.14%  0.10%  0.03%  0.21%
##    78:   0.14%  0.10%  0.03%  0.21%
##    79:   0.14%  0.10%  0.03%  0.21%
##    80:   0.14%  0.10%  0.03%  0.21%
##    81:   0.14%  0.10%  0.03%  0.21%
##    82:   0.15%  0.10%  0.03%  0.22%
##    83:   0.15%  0.10%  0.03%  0.22%
##    84:   0.15%  0.10%  0.03%  0.22%
##    85:   0.14%  0.10%  0.03%  0.21%
##    86:   0.15%  0.10%  0.03%  0.22%
##    87:   0.15%  0.10%  0.03%  0.22%
##    88:   0.15%  0.10%  0.03%  0.22%
##    89:   0.15%  0.10%  0.03%  0.22%
##    90:   0.15%  0.10%  0.03%  0.22%
##    91:   0.16%  0.10%  0.03%  0.23%
##    92:   0.15%  0.10%  0.03%  0.22%
##    93:   0.15%  0.10%  0.03%  0.22%
##    94:   0.16%  0.10%  0.03%  0.23%
##    95:   0.15%  0.10%  0.03%  0.22%
##    96:   0.15%  0.10%  0.03%  0.22%
##    97:   0.14%  0.10%  0.03%  0.21%
##    98:   0.14%  0.10%  0.03%  0.21%
##    99:   0.14%  0.10%  0.03%  0.21%
##   100:   0.14%  0.10%  0.03%  0.21%
##   101:   0.15%  0.10%  0.03%  0.22%
##   102:   0.15%  0.10%  0.03%  0.22%
##   103:   0.15%  0.10%  0.03%  0.22%
##   104:   0.16%  0.10%  0.03%  0.23%
##   105:   0.16%  0.10%  0.03%  0.23%
##   106:   0.16%  0.10%  0.03%  0.23%
##   107:   0.16%  0.10%  0.03%  0.23%
##   108:   0.16%  0.10%  0.03%  0.25%
##   109:   0.16%  0.10%  0.03%  0.25%
##   110:   0.16%  0.10%  0.03%  0.25%
##   111:   0.16%  0.10%  0.03%  0.25%
##   112:   0.17%  0.15%  0.03%  0.25%
##   113:   0.17%  0.15%  0.03%  0.25%
##   114:   0.17%  0.15%  0.03%  0.25%
##   115:   0.16%  0.10%  0.03%  0.25%
##   116:   0.16%  0.10%  0.03%  0.25%
##   117:   0.17%  0.15%  0.03%  0.25%
##   118:   0.16%  0.15%  0.03%  0.23%
##   119:   0.16%  0.15%  0.03%  0.23%
##   120:   0.16%  0.10%  0.03%  0.23%
##   121:   0.16%  0.10%  0.03%  0.23%
##   122:   0.16%  0.10%  0.03%  0.23%
##   123:   0.16%  0.15%  0.03%  0.23%
##   124:   0.15%  0.10%  0.03%  0.22%
##   125:   0.15%  0.10%  0.03%  0.22%
##   126:   0.15%  0.10%  0.03%  0.22%
##   127:   0.15%  0.10%  0.03%  0.22%
##   128:   0.15%  0.10%  0.03%  0.22%
##   129:   0.14%  0.10%  0.03%  0.21%
##   130:   0.15%  0.15%  0.03%  0.21%
##   131:   0.14%  0.10%  0.03%  0.21%
##   132:   0.15%  0.15%  0.03%  0.21%
##   133:   0.15%  0.15%  0.03%  0.21%
##   134:   0.15%  0.15%  0.03%  0.21%
##   135:   0.16%  0.15%  0.03%  0.22%
##   136:   0.16%  0.15%  0.03%  0.22%
##   137:   0.16%  0.15%  0.03%  0.22%
##   138:   0.16%  0.15%  0.03%  0.22%
##   139:   0.16%  0.15%  0.03%  0.22%
##   140:   0.16%  0.15%  0.03%  0.22%
##   141:   0.15%  0.15%  0.03%  0.21%
##   142:   0.15%  0.15%  0.03%  0.21%
##   143:   0.15%  0.15%  0.03%  0.21%
##   144:   0.16%  0.15%  0.03%  0.22%
##   145:   0.15%  0.10%  0.03%  0.22%
##   146:   0.15%  0.10%  0.03%  0.22%
##   147:   0.15%  0.10%  0.03%  0.22%
##   148:   0.15%  0.10%  0.03%  0.22%
##   149:   0.15%  0.10%  0.03%  0.22%
##   150:   0.14%  0.10%  0.03%  0.21%
##   151:   0.14%  0.10%  0.03%  0.21%
##   152:   0.14%  0.10%  0.03%  0.21%
##   153:   0.16%  0.15%  0.03%  0.22%
##   154:   0.16%  0.15%  0.03%  0.22%
##   155:   0.16%  0.15%  0.03%  0.22%
##   156:   0.16%  0.15%  0.03%  0.22%
##   157:   0.16%  0.15%  0.03%  0.22%
##   158:   0.16%  0.15%  0.03%  0.22%
##   159:   0.16%  0.15%  0.03%  0.22%
##   160:   0.16%  0.15%  0.03%  0.22%
##   161:   0.16%  0.15%  0.03%  0.22%
##   162:   0.16%  0.15%  0.03%  0.22%
##   163:   0.16%  0.15%  0.03%  0.22%
##   164:   0.16%  0.15%  0.03%  0.22%
##   165:   0.16%  0.15%  0.03%  0.22%
##   166:   0.16%  0.15%  0.03%  0.22%
##   167:   0.16%  0.15%  0.03%  0.22%
##   168:   0.16%  0.15%  0.03%  0.22%
##   169:   0.16%  0.15%  0.03%  0.22%
##   170:   0.16%  0.15%  0.03%  0.22%
##   171:   0.16%  0.15%  0.03%  0.22%
##   172:   0.16%  0.15%  0.03%  0.22%
##   173:   0.16%  0.15%  0.03%  0.22%
##   174:   0.16%  0.15%  0.03%  0.22%
##   175:   0.16%  0.15%  0.03%  0.22%
##   176:   0.16%  0.15%  0.03%  0.22%
##   177:   0.15%  0.10%  0.03%  0.22%
##   178:   0.15%  0.10%  0.03%  0.22%
##   179:   0.15%  0.10%  0.03%  0.22%
##   180:   0.14%  0.10%  0.03%  0.21%
##   181:   0.14%  0.10%  0.03%  0.21%
##   182:   0.14%  0.10%  0.03%  0.21%
##   183:   0.15%  0.10%  0.03%  0.22%
##   184:   0.15%  0.10%  0.03%  0.22%
##   185:   0.15%  0.10%  0.03%  0.22%
##   186:   0.15%  0.10%  0.03%  0.22%
##   187:   0.15%  0.10%  0.03%  0.22%
##   188:   0.15%  0.10%  0.03%  0.22%
##   189:   0.15%  0.10%  0.03%  0.22%
##   190:   0.15%  0.10%  0.03%  0.22%
##   191:   0.15%  0.10%  0.03%  0.22%
##   192:   0.15%  0.10%  0.03%  0.22%
##   193:   0.15%  0.10%  0.03%  0.22%
##   194:   0.15%  0.10%  0.03%  0.22%
##   195:   0.15%  0.10%  0.03%  0.22%
##   196:   0.15%  0.10%  0.03%  0.22%
##   197:   0.15%  0.10%  0.03%  0.22%
##   198:   0.15%  0.10%  0.03%  0.22%
##   199:   0.15%  0.10%  0.03%  0.22%
##   200:   0.15%  0.10%  0.03%  0.22%

Variable Importance

rf_imp <- varImp(rf.train)
plot(rf_imp, scales = list(y = list(cex = .95)))

# mtry: number of variables randomly sampled as candidates at each split
# ntree: number of trees to grow
# cutoff: cutoff probabilities in majority vote

Prediction

rf.pred <- predict(rf.train, newdata=test)
confusionMatrix(rf.pred, test$Pollution)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction HIGH  LOW MEDIUM
##     HIGH    494    0      1
##     LOW       0  898      1
##     MEDIUM    0    0   1817
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9994          
##                  95% CI : (0.9978, 0.9999)
##     No Information Rate : 0.5665          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9989          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: HIGH Class: LOW Class: MEDIUM
## Sensitivity               1.0000     1.0000        0.9989
## Specificity               0.9996     0.9996        1.0000
## Pos Pred Value            0.9980     0.9989        1.0000
## Neg Pred Value            1.0000     1.0000        0.9986
## Prevalence                0.1538     0.2797        0.5665
## Detection Rate            0.1538     0.2797        0.5659
## Detection Prevalence      0.1542     0.2800        0.5659
## Balanced Accuracy         0.9998     0.9998        0.9995
rfPred = predict(rf.train, newdata=test)
CM = confusionMatrix(factor(rfPred), test$Pollution)$table


threshold = 0.2
rfProb = predict(rf.train, newdata=test, type="prob")
rfPred = rep("HIGH", nrow(test))
rfPred[which(rfProb[,2] > threshold)] = "LOW"
rfPred[which(rfProb[,3] > threshold)] = "MEDIUM"
confusionMatrix(factor(rfPred), test$Pollution)$table
##           Reference
## Prediction HIGH  LOW MEDIUM
##     HIGH    491    0      0
##     LOW       0  891      1
##     MEDIUM    3    7   1818

Random forest creates several classification trees, not looking for the best prediction but making multiple random predictions. Consequently, there is a higher diversity, which will indeed help the prediction become much smoother.

Furthermore, this algorithm is to be highlighted by being powerful and highly accurate, as well as its feature of being able to perform so many predictions at once, with no need of normalization. Besides, it reduces the risk of overfitting, it provides flexibility (since it can handle both regression and classification tasks with a high degree of accuracy) and it makes it easy to evaluate variable importance, or contribution to the model.

The random forest accuracy we have got , again is very high , which confirms the mentioned benefits since it is, higher than the one from the decision tree algorithm (although not by much, as both accuracies are already pretty high due to the way this variable has been created). Also, the number of grown trees has been 200, while the number of variables tried at each split has been 10. Parameters such as kappa have a great value.

On the whole, between decision trees and random forest, it goes without saying that when looking for the best predictions, random forest is the best choice. Even though it may take more time (being worse for higher dimensions data), it returns a better prediction of the data that is being evaluated.

Advanced Regression

Preparing data

In this part, we will be predicting the price of a car.

# DATA PARTITION
data2 = data # data where i will convert all variables to numerical
# first character to factor
# then factor to numeric

data2$Manufacturer = as.factor(data2$Manufacturer)
data2$Manufacturer = as.numeric(data2$Manufacturer)

data2$Model = as.factor(data2$Model)
data2$Model = as.numeric(data2$Model)

data2$Category = as.factor(data2$Category)
data2$Category = as.numeric(data2$Category)

data2$Leather.interior = as.factor(data2$Leather.interior)
data2$Leather.interior = as.numeric(data2$Leather.interior)

data2$Fuel.type = as.factor(data2$Fuel.type)
data2$Fuel.type = as.numeric(data2$Fuel.type)

data2$Gear.box.type = as.factor(data2$Gear.box.type)
data2$Gear.box.type = as.numeric(data2$Gear.box.type)

data2$Drive.wheels = as.factor(data2$Drive.wheels)
data2$Drive.wheels = as.numeric(data2$Drive.wheels)

data2$Doors = as.factor(data2$Doors)
data2$Doors = as.numeric(data2$Doors)

data2$Wheel = as.factor(data2$Wheel)
data2$Wheel = as.numeric(data2$Wheel)

data2$Color = as.factor(data2$Color)
data2$Color = as.numeric(data2$Color)

data2$Pollution= as.factor(data2$Pollution)
data2$Pollution = as.numeric(data2$Pollution)
in_train <- createDataPartition(data2$Price, p = 0.75, list = FALSE)  # 75% for training
training <- data2[in_train,]
testing <- data2[-in_train,]
nrow(training)
## [1] 12050
nrow(testing)
## [1] 4014

Linear Approaches

corr=cor(training)

First we try a simple regression.

linFit <- lm(Price ~., data=training)
summary(linFit)
## 
## Call:
## lm(formula = Price ~ ., data = training)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34981  -6545   -583   5694  37060 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2.511e+06  5.321e+04 -47.187  < 2e-16 ***
## Levy             -5.247e+00  2.589e-01 -20.267  < 2e-16 ***
## Manufacturer      2.501e+01  5.958e+00   4.197 2.73e-05 ***
## Model             7.651e-01  2.916e-01   2.624  0.00871 ** 
## Prod.year         1.249e+03  2.642e+01  47.268  < 2e-16 ***
## Category         -3.308e+02  3.413e+01  -9.691  < 2e-16 ***
## Leather.interior -6.379e+02  2.360e+02  -2.703  0.00689 ** 
## Fuel.type        -1.973e+03  7.740e+01 -25.487  < 2e-16 ***
## Engine.volume     3.061e+03  2.359e+02  12.976  < 2e-16 ***
## Mileage          -8.380e-07  2.063e-06  -0.406  0.68460    
## Cylinders         4.226e+02  1.633e+02   2.588  0.00965 ** 
## Gear.box.type     2.605e+03  1.082e+02  24.091  < 2e-16 ***
## Drive.wheels      1.372e+03  1.884e+02   7.283 3.47e-13 ***
## Doors             1.137e+03  4.104e+02   2.771  0.00560 ** 
## Wheel            -2.450e+03  3.649e+02  -6.715 1.96e-11 ***
## Color             3.420e+01  1.651e+01   2.072  0.03831 *  
## Airbags          -4.660e+02  2.411e+01 -19.328  < 2e-16 ***
## New              -4.403e+03  5.003e+02  -8.801  < 2e-16 ***
## Turbo             2.942e+03  3.212e+02   9.158  < 2e-16 ***
## Fuel.idx          1.033e+04  3.958e+02  26.098  < 2e-16 ***
## Pollution         2.073e+03  1.864e+02  11.125  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9502 on 12029 degrees of freedom
## Multiple R-squared:  0.3108, Adjusted R-squared:  0.3097 
## F-statistic: 271.3 on 20 and 12029 DF,  p-value: < 2.2e-16

We can see that the multiple R-squared is 0.3. This means that the proportion of variation in dependent variable that can be explained by the independent variables is very low. We will try different models later to try to increase it in order to provide a better measure of how well observed outcomes are replicated. But we must take into account that the value is, at first, very low.

It is also to consider that the R-squared value could have been that low because we are not considering many variables for the prediction. Adding more variables to the model may be helpful to improve this value and the performance of the model in general (we will see later).

par(mfrow=c(2,2))
plot(linFit, pch=21 ,bg='blue1',cex=1)

These plots show the different lines found to fit the data points available on each plot, so that we can use them for better predictions.

The residuals vs fitted plot is used to detect non-linearity, unequal error variances and outliers. We can see that we have a slightly curve in this graph. However, considering that we have a lot of data points, we could say that linearity seems to be holding reasonably okay as the red line is close to the dashed line. As we move to the right on the x-axis, the spread of the residuals seems to be increasing. At the same time, there are many outliers (with large residual values).

The theoretical quantiles plot (Q-Q) tells us that the data is normally distributed as the points are kind of falling on a 45-degree line.

The fitted values plot is showing the regression line (best line fitting the data) that represents the regression equation of the model.

The residuals vs leverage plot allows us to identify influential observations in the model. We can say that there are very few observations that have a high leverage and so therefore, just few observations have a strong influence on the coefficients of the regression model. If we remove these observations, the coefficients of the model will definitely change. The points that fall outside of Cook’s distance (those are the red dashed lines) are considered to be influential observations. From the graph, we can see that we don’t have any influential points.

predictions=predict(linFit,newdata = testing)
cor(predictions, testing$Prod.year)^2
## [1] 0.2244665
par(mfrow=c(1,1)) # to set plots to normal back again

Right now, this model is not very good (very low correlation).

Multiple Regression

We will now try to see if with multiple regression we can obtain better results.

linFit <- lm(Price ~ Fuel.idx + Mileage + Price + Levy + Engine.volume + Manufacturer +
               Category + Cylinders + Fuel.type +
               Gear.box.type + Drive.wheels + Wheel + Color + Airbags + New + 
               Turbo, data=training)
summary(linFit)
## 
## Call:
## lm(formula = Price ~ Fuel.idx + Mileage + Price + Levy + Engine.volume + 
##     Manufacturer + Category + Cylinders + Fuel.type + Gear.box.type + 
##     Drive.wheels + Wheel + Color + Airbags + New + Turbo, data = training)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27494  -8094  -1321   5696  38569 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.399e+04  9.667e+02  24.812  < 2e-16 ***
## Fuel.idx       1.069e+04  3.811e+02  28.055  < 2e-16 ***
## Mileage       -2.625e-06  2.261e-06  -1.161 0.245682    
## Levy          -9.422e-01  2.662e-01  -3.539 0.000403 ***
## Engine.volume  4.439e+02  2.441e+02   1.819 0.068983 .  
## Manufacturer   7.997e+00  6.444e+00   1.241 0.214592    
## Category      -1.862e+02  3.615e+01  -5.150 2.64e-07 ***
## Cylinders     -1.290e+03  1.511e+02  -8.534  < 2e-16 ***
## Fuel.type     -1.569e+03  8.348e+01 -18.798  < 2e-16 ***
## Gear.box.type  2.073e+03  1.155e+02  17.945  < 2e-16 ***
## Drive.wheels   5.219e+02  2.006e+02   2.602 0.009271 ** 
## Wheel         -7.564e+03  3.715e+02 -20.359  < 2e-16 ***
## Color          2.997e+01  1.807e+01   1.659 0.097235 .  
## Airbags       -2.644e+02  2.586e+01 -10.225  < 2e-16 ***
## New           -5.715e+03  5.475e+02 -10.439  < 2e-16 ***
## Turbo          2.049e+03  3.497e+02   5.860 4.74e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10420 on 12034 degrees of freedom
## Multiple R-squared:  0.1707, Adjusted R-squared:  0.1697 
## F-statistic: 165.2 on 15 and 12034 DF,  p-value: < 2.2e-16

R2 is actually lower than before (some variables may not be significant). Right now these results are just telling us that the independent variables are not explaining much (or any) of the dependent variables. The model we have by the moment is not explaining the variance in the dependent variable in the sample we have.

Model Selection

exhaustive <- regsubsets(Price ~ Fuel.idx + Mileage + Prod.year + Levy + Engine.volume + Manufacturer
                         + Model + Category + Cylinders + Leather.interior + Fuel.type +
                           Gear.box.type + Drive.wheels + Wheel + Color + Airbags + New + 
                           Turbo, data=training)

model = Price ~ Fuel.idx + Mileage + Prod.year + Levy + Engine.volume + Manufacturer+
  Model + Category + Cylinders + Leather.interior + Fuel.type +
  Gear.box.type + Drive.wheels + Wheel + Color + Airbags + New + 
  Turbo

linFit <- lm(model, data=training)
# ols_step_all_possible(linFit) # All possible subset regressions: the number is exponential with p 

# ols_step_best_subset(linFit) # The best subset regression for each p: still exponential with p 

ols_step_forward_p(linFit) # forward based on p-value
## 
##                                     Selection Summary                                      
## ------------------------------------------------------------------------------------------
##         Variable                          Adj.                                                
## Step        Entered         R-Square    R-Square      C(p)           AIC           RMSE       
## ------------------------------------------------------------------------------------------
##    1    Prod.year             0.0701      0.0700    4012.2757    258530.7151    11029.3046    
##    2    Fuel.idx              0.1286      0.1284    3003.8949    257749.6311    10677.1311    
##    3    Fuel.type             0.1770      0.1768    2170.2885    257063.2009    10376.8785    
##    4    Gear.box.type         0.2331      0.2328    1203.5800    256214.5424    10017.4101    
##    5    Levy                  0.2635      0.2632     680.7769    255729.3186     9817.3315    
##    6    Airbags               0.2781      0.2778     429.5998    255489.0345     9719.5331    
##    7    Wheel                 0.2824      0.2820     356.9923    255418.6977     9690.8057    
##    8    New                   0.2865      0.2860     288.9869    255352.4258     9663.7931    
##    9    Turbo                 0.2902      0.2896     227.8206    255292.4899     9639.3898    
##   10    Category              0.2937      0.2931     169.2541    255234.8027     9615.9453    
##   11    Engine.volume         0.2975      0.2968     105.2002    255171.3689     9590.2708    
##   12    Drive.wheels          0.2994      0.2987      73.5029    255139.8505     9577.3397    
##   13    Manufacturer          0.3013      0.3005      43.8226    255110.2533     9565.1885    
##   14    Cylinders             0.3022      0.3014      29.3248    255095.7659     9559.0441    
##   15    Leather.interior      0.3026      0.3018      24.1827    255090.6212     9556.6077    
##   16    Model                 0.3030      0.3021      20.2028    255086.6359     9554.6316    
##   17    Color                 0.3033      0.3023      17.0328    255083.4589     9552.9763    
## ------------------------------------------------------------------------------------------
plot(ols_step_forward_p(linFit))

This function builds regression model from a set of candidate predictor variables by entering predictors based on p-values. In the plot we can see how R-square grows exponentially from 0.1 to 0.3 (very low for a model). The same thing happens with the adjusted R-square as it is just adjusting the number of terms in a model (considers and tests different independent variables against the model). C(p) (Mallow’s Cp) is used to assess the fit of the regression model that has been estimated. AIC (and SIBC (information criterions) involve information criteria function calculations for the model.

ols_step_forward_aic(linFit) # forward based on AIC
## 
##                                     Selection Summary                                      
## ------------------------------------------------------------------------------------------
## Variable               AIC             Sum Sq             RSS          R-Sq      Adj. R-Sq 
## ------------------------------------------------------------------------------------------
## Prod.year           258530.715         1.10433e+11    1.465586e+12    0.07007      0.06999 
## Fuel.idx            257749.631    202647123410.469    1.373372e+12    0.12858      0.12844 
## Fuel.type           257063.201    278910158867.316    1.297109e+12    0.17697      0.17677 
## Gear.box.type       256214.542         3.67321e+11    1.208698e+12    0.23307      0.23281 
## Levy                255729.319         4.15218e+11    1.160801e+12    0.26346      0.26315 
## Airbags             255489.035    438324655978.113    1.137694e+12    0.27812      0.27776 
## Wheel               255418.698    445133838559.480    1.130885e+12    0.28244      0.28202 
## New                 255352.426         4.51523e+11    1.124496e+12    0.28650      0.28602 
## Turbo               255292.490         4.57288e+11    1.118731e+12    0.29015      0.28962 
## Category            255234.803    462815677013.161    1.113203e+12    0.29366      0.29307 
## Engine.volume       255171.369    468844199439.694    1.107175e+12    0.29749      0.29684 
## Drive.wheels        255139.851    471919642250.367    1.104099e+12    0.29944      0.29874 
## Manufacturer        255110.253         4.74811e+11    1.101208e+12    0.30127      0.30052 
## Cylinders           255095.766    476316693602.173    1.099702e+12    0.30223      0.30142 
## Leather.interior    255090.621    476968532587.510     1.09905e+12    0.30264      0.30177 
## Model               255086.636    477514302487.423    1.098504e+12    0.30299      0.30206 
## Color               255083.459    477986148124.548    1.098033e+12    0.30329      0.30230 
## ------------------------------------------------------------------------------------------
plot(ols_step_forward_aic(linFit))

In the plot we see the forward ols step based on AIC, were Production year is the most important variable related with the price. This makes sense because through the years, the prices (in general) are under constant change. The fuel information of a car and the gear box type also seems to be important for the price prediction.

ols_step_backward_aic(linFit) # backward AIC
## 
## 
##                              Backward Elimination Summary                             
## ------------------------------------------------------------------------------------
## Variable         AIC            RSS              Sum Sq          R-Sq      Adj. R-Sq 
## ------------------------------------------------------------------------------------
## Full Model    255085.426     1.09803e+12    477989141610.556    0.30329      0.30225 
## Mileage       255083.459    1.098033e+12    477986148124.555    0.30329      0.30230 
## ------------------------------------------------------------------------------------
plot(ols_step_backward_aic(linFit))

With this statistical method, we are able to find the simplest model that explains the data. Unlinlike the forward method, Mileage is the variable that seems to explain better the price of the cars.

ols_step_both_aic(linFit) # stepwise AIC
## 
## 
##                                             Stepwise Summary                                            
## ------------------------------------------------------------------------------------------------------
## Variable             Method        AIC            RSS              Sum Sq          R-Sq      Adj. R-Sq 
## ------------------------------------------------------------------------------------------------------
## Prod.year           addition    258530.715    1.465586e+12         1.10433e+11    0.07007      0.06999 
## Fuel.idx            addition    257749.631    1.373372e+12    202647123410.469    0.12858      0.12844 
## Fuel.type           addition    257063.201    1.297109e+12    278910158867.316    0.17697      0.17677 
## Gear.box.type       addition    256214.542    1.208698e+12         3.67321e+11    0.23307      0.23281 
## Levy                addition    255729.319    1.160801e+12         4.15218e+11    0.26346      0.26315 
## Airbags             addition    255489.035    1.137694e+12    438324655978.113    0.27812      0.27776 
## Wheel               addition    255418.698    1.130885e+12    445133838559.480    0.28244      0.28202 
## New                 addition    255352.426    1.124496e+12         4.51523e+11    0.28650      0.28602 
## Turbo               addition    255292.490    1.118731e+12         4.57288e+11    0.29015      0.28962 
## Category            addition    255234.803    1.113203e+12    462815677013.161    0.29366      0.29307 
## Engine.volume       addition    255171.369    1.107175e+12    468844199439.694    0.29749      0.29684 
## Drive.wheels        addition    255139.851    1.104099e+12    471919642250.367    0.29944      0.29874 
## Manufacturer        addition    255110.253    1.101208e+12         4.74811e+11    0.30127      0.30052 
## Cylinders           addition    255095.766    1.099702e+12    476316693602.173    0.30223      0.30142 
## Leather.interior    addition    255090.621     1.09905e+12    476968532587.510    0.30264      0.30177 
## Model               addition    255086.636    1.098504e+12    477514302487.423    0.30299      0.30206 
## Color               addition    255083.459    1.098033e+12    477986148124.548    0.30329      0.30230 
## ------------------------------------------------------------------------------------------------------
plot(ols_step_both_aic(linFit))

With the step wise AIC method, we get similar results as the ones with forward AIC.

linFit <- lm(Price ~ Fuel.idx + Mileage + Prod.year + Levy + Engine.volume + Manufacturer+
               Model + Category + Cylinders + Leather.interior + Fuel.type +
               Gear.box.type + Drive.wheels + Wheel + Color + Airbags + New + 
               Turbo, data=training)
summary(linFit)
## 
## Call:
## lm(formula = Price ~ Fuel.idx + Mileage + Prod.year + Levy + 
##     Engine.volume + Manufacturer + Model + Category + Cylinders + 
##     Leather.interior + Fuel.type + Gear.box.type + Drive.wheels + 
##     Wheel + Color + Airbags + New + Turbo, data = training)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -36378  -6554   -489   5642  36731 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2.501e+06  5.338e+04 -46.849  < 2e-16 ***
## Fuel.idx          1.236e+04  3.513e+02  35.177  < 2e-16 ***
## Mileage          -3.755e-07  2.074e-06  -0.181 0.856288    
## Prod.year         1.250e+03  2.647e+01  47.223  < 2e-16 ***
## Levy             -5.173e+00  2.600e-01 -19.895  < 2e-16 ***
## Engine.volume     2.376e+03  2.294e+02  10.357  < 2e-16 ***
## Manufacturer      2.778e+01  5.984e+00   4.642 3.48e-06 ***
## Model             6.849e-01  2.927e-01   2.340 0.019315 *  
## Category         -3.136e+02  3.353e+01  -9.352  < 2e-16 ***
## Cylinders        -5.419e+02  1.395e+02  -3.885 0.000103 ***
## Leather.interior -5.854e+02  2.371e+02  -2.469 0.013560 *  
## Fuel.type        -1.979e+03  7.774e+01 -25.460  < 2e-16 ***
## Gear.box.type     2.571e+03  1.087e+02  23.658  < 2e-16 ***
## Drive.wheels      1.142e+03  1.869e+02   6.112 1.01e-09 ***
## Wheel            -2.507e+03  3.657e+02  -6.856 7.42e-12 ***
## Color             3.772e+01  1.659e+01   2.273 0.023022 *  
## Airbags          -4.489e+02  2.419e+01 -18.556  < 2e-16 ***
## New              -4.407e+03  5.030e+02  -8.762  < 2e-16 ***
## Turbo             2.828e+03  3.220e+02   8.783  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9553 on 12031 degrees of freedom
## Multiple R-squared:  0.3033, Adjusted R-squared:  0.3022 
## F-statistic:   291 on 18 and 12031 DF,  p-value: < 2.2e-16

Not a reasonable model.

Linear Regression

In this case, we will try to consider a new model to see if it’s better.

ctrl <- trainControl(method = "repeatedcv", 
                     number = 5, repeats = 1)

ModelS = Price ~ Prod.year + Mileage + Cylinders + Engine.volume + Gear.box.type + 
  Fuel.type + Category + Manufacturer

Train

lm_tune <- train(ModelS, data = training, 
                 method = "lm", 
                 preProc=c('scale', 'center'),
                 trControl = ctrl)
lm_tune
## Linear Regression 
## 
## 12050 samples
##     8 predictor
## 
## Pre-processing: scaled (8), centered (8) 
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 9639, 9639, 9640, 9641, 9641 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   10540.89  0.1506547  8431.032
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Predict

test_results <- data.frame(price = testing$Price)

test_results$lm <- predict(lm_tune, testing)
postResample(pred = test_results$lm,  obs = test_results$price)
##         RMSE     Rsquared          MAE 
## 1.059690e+04 1.425268e-01 8.457320e+03

Visualization

qplot(test_results$lm, test_results$price) + 
  labs(title="Linear Regression Observed VS Predicted", x="Predicted", y="Observed") +
  lims(x = c(0, 50500), y = c(0, 50500)) +
  geom_abline(intercept = 0, slope = 1, colour = "cyan") +
  theme_bw()

After applying this linear regression model, we still observe that we kind of reach the same conclusions, really low correlations!

Over fitted linear Regression

ModelFF = Price ~ Fuel.idx + Mileage + Prod.year + Levy + Engine.volume + Manufacturer+
  Model + Category + Cylinders + Leather.interior + Fuel.type +
  Gear.box.type + Drive.wheels + Wheel + Color + Airbags + New + 
  Turbo

Train

alm_tune <- train(ModelFF, data = training, 
                  method = "lm", 
                  preProc=c('scale', 'center'),
                  trControl = ctrl)

Predict

test_results$alm <- predict(alm_tune, testing)
postResample(pred = test_results$alm,  obs = test_results$price)
##         RMSE     Rsquared          MAE 
## 9764.7075204    0.2726752 7617.9727971
# even though it is still very low, we can see how the rsquared value has increased

Visualization

qplot(test_results$alm, test_results$price) + 
  labs(title="Linear Regression Observed VS Predicted", x="Predicted", y="Observed") +
  lims(x = c(0, 50500), y = c(0, 50500)) +
  geom_abline(intercept = 0, slope = 1, colour = "cyan") +
  theme_bw()

With the over fitted linear regression model we see that it performs better than with the previous model, as R2 has increased by 0.1 (but this is still very low). However, this model will not generalize well to new data. In other words, it performs well for training data, but when performing on testing data it will perform poorly.

Forward Regression

Train

for_tune <- train(ModelFF, data = training, 
                  method = "leapForward", 
                  preProc=c('scale', 'center'),
                  tuneGrid = expand.grid(nvmax = 4:10),
                  trControl = ctrl)

for_tune
## Linear Regression with Forward Selection 
## 
## 12050 samples
##    18 predictor
## 
## Pre-processing: scaled (18), centered (18) 
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 9640, 9640, 9640, 9639, 9641 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE       Rsquared   MAE     
##    4     10019.601  0.2325435  7922.638
##    5      9820.318  0.2630073  7690.351
##    6      9723.115  0.2774605  7639.617
##    7      9708.525  0.2797322  7632.513
##    8      9675.124  0.2846655  7605.876
##    9      9649.654  0.2884678  7593.986
##   10      9622.573  0.2923922  7579.291
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 10.
plot(for_tune)

Predict

# variables that are selected
coef(for_tune$finalModel, for_tune$bestTune$nvmax)
##   (Intercept)      Fuel.idx     Prod.year          Levy      Category 
##    14586.0495     3550.2349     4875.3344    -1891.5734     -697.1272 
##     Fuel.type Gear.box.type         Wheel       Airbags           New 
##    -2867.0605     2455.9784     -819.6320    -1573.2975     -750.6098 
##         Turbo 
##      770.5597
test_results$frw <- predict(for_tune, testing)
postResample(pred = test_results$frw,  obs = test_results$price)
##         RMSE     Rsquared          MAE 
## 9831.3195473    0.2627968 7669.0415601

Visualization

qplot(test_results$frw, test_results$price) + 
  labs(title="Forward Regression Observed VS Predicted", x="Predicted", y="Observed") +
  lims(x = c(0, 50000), y = c(0, 50000)) +
  geom_abline(intercept = 0, slope = 1, colour = "cyan") +
  theme_bw()

We can see that the higher the number of predictors, the better the accuracy. But there is no huge difference. Still nothing new for these related models…

Backward Regression

Train

back_tune <- train(ModelFF, data = training, 
                   method = "leapBackward", 
                   preProc=c('scale', 'center'),
                   tuneGrid = expand.grid(nvmax = 4:10),
                   trControl = ctrl)
back_tune
## Linear Regression with Backwards Selection 
## 
## 12050 samples
##    18 predictor
## 
## Pre-processing: scaled (18), centered (18) 
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 9641, 9639, 9639, 9640, 9641 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE       Rsquared   MAE     
##    4     10019.599  0.2329168  7922.251
##    5      9820.234  0.2629616  7689.203
##    6      9723.220  0.2773933  7638.404
##    7      9705.508  0.2800372  7631.584
##    8      9677.982  0.2840444  7611.935
##    9      9644.664  0.2889592  7584.385
##   10      9613.055  0.2935802  7562.309
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 10.
plot(back_tune)

Predict

# variables that are selected
coef(back_tune$finalModel, back_tune$bestTune$nvmax)
##   (Intercept)      Fuel.idx     Prod.year          Levy Engine.volume 
##    14586.0495     3475.8132     5361.5362    -2194.3142      997.7935 
##      Category     Fuel.type Gear.box.type       Airbags           New 
##     -732.2579    -2785.7817     2367.0933    -1799.4518     -783.1655 
##         Turbo 
##      900.6762
test_results$bw <- predict(back_tune, testing)
postResample(pred = test_results$bw,  obs = test_results$price)
##        RMSE    Rsquared         MAE 
## 9820.602262    0.264235 7656.466569

Visualization

qplot(test_results$bw, test_results$price) + 
  labs(title="Backward Regression Observed VS Predicted", x="Predicted", y="Observed") +
  lims(x = c(0, 50000), y = c(0, 500000)) +
  geom_abline(intercept = 0, slope = 1, colour = "cyan") +
  theme_bw()

We are obtaining very similar conclusions than in forward regression.

Stepwise Regression

step_tune <- train(ModelFF, data = training, 
                   method = "leapSeq", 
                   preProc=c('scale', 'center'),
                   tuneGrid = expand.grid(nvmax = 4:10),
                   trControl = ctrl)
plot(step_tune)

# which variables are selected?
coef(step_tune$finalModel, step_tune$bestTune$nvmax)
##   (Intercept)      Fuel.idx     Prod.year          Levy Engine.volume 
##    14586.0495     3475.8132     5361.5362    -2194.3142      997.7935 
##      Category     Fuel.type Gear.box.type       Airbags           New 
##     -732.2579    -2785.7817     2367.0933    -1799.4518     -783.1655 
##         Turbo 
##      900.6762
test_results$seq <- predict(step_tune, testing)
postResample(pred = test_results$seq,  obs = test_results$price)
##        RMSE    Rsquared         MAE 
## 9820.602262    0.264235 7656.466569

No conclusions reached with these statistical tools. Nevertheless, we see that in order to reduce the RMSE, the best number of predictors was 9. But still not the numbers we are looking for in RMSE and R2 when talking about the testing set.

Ridge Regression

# X matrix
X = model.matrix(ModelFF, data=training)[,-1]  # skip column of ones

# y variable
y = training$Price
grid = seq(0, .1, length = 100)  # a 100-size grid for lambda (rho in slides)
ridge.mod = glmnet(X, y, alpha=0, lambda=grid)  # alpha=0 for ridge regression

dim(coef(ridge.mod))
## [1]  19 100
ridge.cv = cv.glmnet(X, y, type.measure="mse", alpha=0)
plot(ridge.cv)

opt.lambda <- ridge.cv$lambda.min
opt.lambda
## [1] 302.7303
lambda.index <- which(ridge.cv$lambda == ridge.cv$lambda.1se)
beta.ridge <- ridge.cv$glmnet.fit$beta[, lambda.index]
beta.ridge
##         Fuel.idx          Mileage        Prod.year             Levy 
##     9.916733e+03    -5.152177e-07     9.580160e+02    -3.486091e+00 
##    Engine.volume     Manufacturer            Model         Category 
##     1.369622e+03     1.513909e+01     8.200371e-01    -2.580414e+02 
##        Cylinders Leather.interior        Fuel.type    Gear.box.type 
##    -4.257498e+02    -1.863170e+02    -1.538917e+03     2.136339e+03 
##     Drive.wheels            Wheel            Color          Airbags 
##     9.842520e+02    -2.932697e+03     3.132923e+01    -3.623346e+02 
##              New            Turbo 
##    -4.158524e+03     3.042682e+03
X.test = model.matrix(ModelFF, data=testing)[,-1]  # skip column of ones

ridge.pred = predict(ridge.cv$glmnet.fit, s=opt.lambda, newx=X.test)

y.test = log(testing$Price)

postResample(pred = ridge.pred,  obs = y.test)
##         RMSE     Rsquared          MAE 
## 15644.068649     0.135503 14487.778333

In this case we have dealt with another tool to improve our model by selecting the best possible betas. However, we reach the same conclusions when evaluating. Indeed, this is not going to be our final choice. We have used the minimal lambda since it is the most interesting for our purposes.

With Caret

# the grid for lambda
ridge_grid <- expand.grid(lambda = seq(0, .1, length = 100))

Train

ridge_tune <- train(ModelFF, data = training,
                    method='ridge',
                    preProc=c('scale','center'),
                    tuneGrid = ridge_grid,
                    trControl=ctrl)
plot(ridge_tune)

Best Tune

ridge_tune$bestTune
##        lambda
## 4 0.003030303

Prediction

test_results$ridge <- predict(ridge_tune, testing)

postResample(pred = test_results$ridge,  obs = test_results$price)
##         RMSE     Rsquared          MAE 
## 9764.3623637    0.2726683 7617.7512353

Similar results, with just a little increase on the R2. Again, it is very low.

Lasso

lasso_grid <- expand.grid(fraction = seq(.01, 1, length = 100))

Train

lasso_tune <- train(ModelFF, data = training,
                    method='lasso',
                    preProc=c('scale','center'),
                    tuneGrid = lasso_grid,
                    trControl=ctrl)
plot(lasso_tune)

Best Tune

lasso_tune$bestTune
##     fraction
## 100        1

Prediction

test_results$lasso <- predict(lasso_tune, testing)
postResample(pred = test_results$lasso,  obs = test_results$price)
##         RMSE     Rsquared          MAE 
## 9764.7075204    0.2726752 7617.9727971

With the regularization technique Lasso, we see that we obtain a more accurate prediction as the R2 is 0.3 (but again, this is too low for a model).

Elastic Net

modelLookup('glmnet')
##    model parameter                    label forReg forClass probModel
## 1 glmnet     alpha        Mixing Percentage   TRUE     TRUE      TRUE
## 2 glmnet    lambda Regularization Parameter   TRUE     TRUE      TRUE
elastic_grid = expand.grid(alpha = seq(0, .2, 0.01), lambda = seq(0, .1, 0.01))

Train

glmnet_tune <- train(ModelFF, data = training,
                     method='glmnet',
                     preProc=c('scale','center'),
                     tuneGrid = elastic_grid,
                     trControl=ctrl)
plot(glmnet_tune)

Best Tune

glmnet_tune$bestTune
##     alpha lambda
## 231   0.2    0.1

Prediction

test_results$glmnet <- predict(glmnet_tune, testing)

postResample(pred = test_results$glmnet,  obs = test_results$price)
##         RMSE     Rsquared          MAE 
## 9763.9481589    0.2726324 7616.7303021

As we can see, any of the techniques used above have been useful for improving our linear model approach. The parameters to evaluate the efficiency of the model have remained as in the beginning, although the r-squared has increased by 0.2 (not much though). Therefore, we are going to try different models in order to achieve the best one.

MACHINE LEARNING TOOLS

KNN

modelLookup('kknn')
##   model parameter           label forReg forClass probModel
## 1  kknn      kmax Max. #Neighbors   TRUE     TRUE      TRUE
## 2  kknn  distance        Distance   TRUE     TRUE      TRUE
## 3  kknn    kernel          Kernel   TRUE     TRUE      TRUE
# 3 hyper-parameters: kmax, distance, kernel
# kmax: number of neighbors considered
# distance: parameter of Minkowski distance (p in Lp)
# kernel: "rectangular" (standard unweighted knn), "triangular", "epanechnikov" (or beta(2,2)), "biweight" (or beta(3,3)), "tri- weight" (or beta(4,4)), "cos", "inv", "gaussian", "rank" and "optimal".
knn_tune <- train(ModelS, 
                  data = training,
                  method = "kknn",   
                  preProc=c('scale','center'),
                  tuneGrid = data.frame(kmax=c(2,5,8,12),distance=2,kernel='optimal'),
                  trControl = ctrl)
plot(knn_tune)

test_results$knn <- predict(knn_tune, testing)

postResample(pred = test_results$knn,  obs = test_results$price)
##         RMSE     Rsquared          MAE 
## 7255.9567715    0.6017215 4627.6160497

Computationally, this has been a very expensive process. There is no doubt that it has been profitable since we have decreased the RMSE and improved in nearly 60% the r-squared. This r-squared value is not good enough, but we have also taken into account that these parameters are going to be lower than in classification because the number of possible outcomes is much higher.

Random Forest

rf_tune <- train(ModelS, 
                 data = training,
                 method = "rf",
                 preProc=c('scale','center'),
                 trControl = ctrl,
                 ntree = 100,
                 tuneGrid = data.frame(mtry=c(1,3,5,7)),
                 importance = TRUE)

plot(rf_tune)

test_results$rf <- predict(rf_tune, testing)
postResample(pred = test_results$rf,  obs = test_results$price)
##         RMSE     Rsquared          MAE 
## 6648.5219271    0.6641209 4437.5440699
# variable importance
plot(varImp(rf_tune, scale = F), scales = list(y = list(cex = .95)))

Again, it has been a really expensive process, but even more efficient. Random Forest takes into account the most correlated variable, Prod.year, as the most important but it also gives importance to others. The RMSE is smaller and the r-squared has increased significantly for our data base. Now we almost have 70% of R2, when we started with10-30% !!

The linear approaches has been improved, as well as the KNN. So far, this has been the best model.

Gradient Boosting

xgb_tune <- train(ModelS, 
                  data = training,
                  method = "xgbTree",
                  preProc=c('scale','center'),
                  objective="reg:squarederror",
                  trControl = ctrl,
                  tuneGrid = expand.grid(nrounds = c(10,50), max_depth = c(5,6,7), eta = c(0.01, 0.1, 1),
                                         gamma = c(1, 2, 3), colsample_bytree = c(1, 2),
                                         min_child_weight = c(1), subsample = c(0.2,0.5,0.8)))

test_results$xgb <- predict(xgb_tune, testing)
postResample(pred = test_results$xgb,  obs = test_results$price)
##         RMSE     Rsquared          MAE 
## 7297.7494915    0.5964994 5257.4270377

As it happened previously, in terms of computation, this process has been really expensive and difficult to overcome random forest. For sure, this is not going to be our model as it is not reliable. Despite the fact that it is a bit better than linear approaches, it is really far from the other machine learning tools.

Ensemble

apply(test_results[-1], 2, function(x) mean(abs(x - test_results$price)))
##       lm      alm      frw       bw      seq    ridge    lasso   glmnet 
## 8457.320 7617.973 7669.042 7656.467 7656.467 7617.751 7617.973 7616.730 
##      knn       rf      xgb 
## 4627.616 4437.544 5257.427
# Combination
test_results$comb = (test_results$alm + test_results$knn + test_results$rf)/3

postResample(pred = test_results$comb,  obs = test_results$price)
##         RMSE     Rsquared          MAE 
## 7116.5643675    0.6325901 5122.6511235

Comparing it with other models, and taking into account that our data is not too well correlated, it is not a bad model. However, in this particular case, the model doesn’t improve the performance obtained from random forest (best model for our database). It is true that it is very close though. Right now, the only two candidates are random forest and knn, with the highest efficiency (which is pretty similar for both).

Final predictions

yhat = test_results$comb

head(yhat) # show the prediction for 6 car prices
## [1] 10630.1362  3897.2514 12599.4401    40.4918 11523.9037 13522.6770
par(mfrow=c(1,1))

hist(yhat, col="lightblue")

Prediction intervals

The errors are more symmetric.

y = test_results$price
error = y-yhat
hist(error, col="mediumpurple3")

Machine Learning tools don’t provide us prediction intervals, but we can split the testing set in two: one for measuring the size of the noise and another for computing the intervals from that size.

Let’s consider the first 1000 cars in testing to compute the noise size.

For the prediction intervals we will fix a 90% confidence.

noise = error[1:1000]

lwr = yhat[101:length(yhat)] + quantile(noise,0.05, na.rm=T)
upr = yhat[101:length(yhat)] + quantile(noise,0.95, na.rm=T)

predictions = data.frame(real=y[101:length(y)], fit=yhat[101:length(yhat)], lwr=lwr, upr=upr)

predictions = predictions %>% mutate(out=factor(if_else(real<lwr | real>upr,1,0)))

# how many real observations are out of the intervals?
mean(predictions$out==1)
## [1] 0.09913132

For the performance, we use the last prices in yhat.

In the plot we can see the observations that are out of the intervals.

ggplot(predictions, aes(x=fit, y=real))+
  geom_point(aes(color=out)) + theme(legend.position="none") +
  xlim(0, 20000) + ylim(0, 20000)+
  geom_ribbon(data=predictions,aes(ymin=lwr,ymax=upr),alpha=0.3) +
  labs(title = "Prediction intervals", x = "prediction",y="real price")

Conclusion

During this analysis, we have started analyzing the cars data set seeking for relations and tendencies through the years. We have later focused on the 21st Century Cars for a better comparison and understanding of the prices and the pollution of the cars.

We have been able to see relations with the colors of the cars, the different categories, models and manufacturers. Also how the fuel types affect pollution, the relation between cylinders and Levy, the engine volume influence too…However, we have mainly focused on classification and prediction.

After evaluating different approaches, we have created a model for classifying the level of contribution to pollution of a car (“HIGH”, “MEDIUM” or “LOW”) with a very good accuracy thanks to random forest. On the contrary, despite the dependency on a set of features, decision trees are also helpful when interpreting a model in an easier and faster way. The high value of accuracy is due to the fact that pollution is a variable that we have created at the beginning from other variables and also taking into account CO2 emissions.

For regression, we have seen that the variables, in general, did not explain much of the variation of the Price. Through different models, we have seen that the accuracy and the r-squared values were not good at all. But it is remarkable how with random forest we were capable of increasing these values. Although we didn’t reach a great r-squared value for the model, we did increase it by a lot, which is a proof of how well random forest performances are, providing much more accurate models.

After evaluating models in classification and regression, we have obtained that random forest is the best model. Taking all the evaluation tools into account, random forest has been by far the one with the best performance. This conclusion is due to the fact that our database is non linear and has very low correlations. As a result, we can say that random forest is one of the best tools for these purpose since it is very precise and can generalize over the data very well. Even if there are a lot of levels from where to choose, it achieves a reasonable accuracy.